NASPA ASSESSMENT & RETENTION CONFERENCE, JUNE 11, 2010 FORREST LANE UNIVERSITY OF NORTH TEXAS Why...

N A S PA A SS E SS M E N T & R E T E N T I O N C O N F E R E N C E ,

J U N E 1 1 , 2 0 1 0

F O R R E S T L A N EU N I V E R S I T Y O F N O RT H T E XA S

Why Propensity Score Matching should be used to Assess Programmatic Effects

Forrest Lane, NASPA Assessment & Retention Conference

Contact Information

6/11/2010

Center for Interdisciplinary Research & Analysis (CIRA)

Department of Educational PsychologyUniversity of North Texas

[email protected]


Program Outline

Assessment Practices within Post-Secondary Education

Challenges to Quasi-Experimental Evaluation and Assessment Methods

Propensity Score Matching

Heuristic Example

6/11/2010


Educational Assessment

Increased importance of modeling university resources to institutional outcomes.

From a student development perspective, this is often through evaluating program effects. First Year Programming

Orientation, New Student Camps, Freshman Seminars Co-curricular Activities

Greek Life, Student Activities, Community Involvement Service-Learning Initiatives Living Learning Communities

6/11/2010


Modeling Effects

6/11/2010

In order to accurately assess programmatic effects, cause & effect needs to be established. May use control/comparison groups (experimental

design)

If quasi-experimental: Participants have traditionally been matched on

demographic or other relevant variables

Or matched on some pre-treatment outcome (examination of baseline differences)


Example from the Literature

The effects of on and off-campus living arrangements were explored with regard to students’ openness to diversity.

The 13 variables in the model were analyzed using path modeling.

Spurious effects were modeled on background characteristics.

Results indicated that living on-campus was directly associated with significantly higher levels of openness to diversity than off campus living.

Pike, G. (2009). The differential effects of on- and off-campus living arrangements on students’ openness to diversity. Journal of Student Affairs Research & Practice, 46(4), 629-645.

6/11/2010


Commonly Reported Limitations

“The fact that students self selected into different residential communities represents another potential limitation of the research. Females, minority students, and higher-ability students were over-represented in the research sample due to the under-representation of off-campus students. Although background differences were accounted for in the study, the possibility remains that the residence groups might have differed in ways that were not explored” (Pike, 2009, p. 639).

6/11/2010


Empirical Problems with Self-Selection

True randomization is rarely an option in educational assessment (Luellen, Shadish, & Clark, 2005; Grunwald & Mayhew, 2008).

As a result, there is an abundance and often over-reliance on reported effects which may inadequately address variables which contribute to differences in treatment group selection.

Non-randomized groups may systematically differ from one based on any number of covariates.

Leads to effect size bias when interpreting treatment effects.

6/11/2010


Experimental vs. Quasi-Experimental

6/11/2010

In true randomization, groups can be directly compared to one another because systematic differences have been controlled through experimental design: Probability of group membership is equal (p = .50).

In quasi-experimental designs, group differences exist from non-randomization and therefore cannot be compared directly to one another. Probability of group membership is not equal (p ≠.50)


The ANCOVA Problem

6/11/2010

ANCOVA is often used to control for differences on an outcome of interest based on theoretically relevant covariates.

Controlling for covariates on an outcome is theoretically different than matching participants on their likelihood to be in a treatment group (independent variable). Covariates which control for outcome differences may

or may not have anything to do with group membership or self-selection.


Solution to Quasi-Experimental Designs

Propensity score matching (PSM) is used to estimate the true treatment effect and to reduce group bias based due to non-randomization. Participants are matched across groups on their

likelihood of group membership.

Recommended method by the U.S. Department of Education to improve the quality of quasi-experimental research (Glen, 2005).

Increasingly used in medical & economic research since mid 1980s.

6/11/2010


Defining a Propensity Score

Defined as the conditional probability of assignment to a particular treatment or control given a set of covariates (Rosenbaum & Rubin, 1983b).

Propensity scores incorporate covariates into a singular scalar variable ranging from 0 to 1.

This new scalar variable can then be used to match participants in treatment groups.

Once matched, treatments effects should be more reflective of the true effect and analogous to interpretation of randomized designs

6/11/2010


Calculating Propensity Scores

The most commonly used methods include using either logistic regression.

Other methods include classification trees or ensemble methods such as bagging, boosted regression trees, and random forest (Shadish, Luellen, & Clark, 2006).

6/11/2010


PSM in the Literature

Grunwald & Mayhew (2008) examined the development of moral reasoning in young adults and demonstrated a significant reduction is the overestimation of effects.

Morgan (2001) used propensity score matching and demonstrated the effect of private school education on math and reading achievement is actually larger than findings in non-matched samples.

Other similar studies have been demonstrated in economics (Dehejia & Wahba, 2002), medicine (Schafer & Kang, 2008), and sociology (Morgan & Harding, 2006).

6/11/2010


PSM in the Literature

6/11/2010

Over 1,000 articles were found in JASTOR having used propensity score matching among sociology, economics, and medical journals, yet it remains virtually absent from educational research & assessment methods.

PSM in Higher Education Literature

The following reflects a search for propensity score matching techniques in the literature between the years of 1996 - 2010

Journal Articles using PSM

Journal of College Student Development

0

Journal of Student Affairs Research & Practice

0

Journal of College & Character

0

Journal of Higher Education 1

Review of Higher Education 2

Research in Higher Education 4

6/11/2010Forrest Lane, NASPA Assessment & Retention Conference


Heuristic Example

College X believes participation in a LLC contributes to better academic performance (GPA).

A sample of 30 students was collected.

Data were examined to determine if academic performance among LLC was statistically & meaningfully different than those who do not participate in an LLC.

6/11/2010


Pre-Matching Achievement Scores

N M SD t df p d

Non Participants 16 3.21 .323 1.795 28 .084 .660

LLC Participants 14 3.43 .565

(3.21)Non- LLC

(3.43)LLC

3.0 4.0

Biased Treatment

Effect

6/11/2010


Propensity Score Calculation

Logistic Regression was performed using SPSS 18.0 using the following covariates to predict participation in an LLC** In-State vs. Out of State Legacy PSAT Scores SAT Scores Gender

Predicted probabilities were saved in the analysis

**Covariates should be theoretically driven variables which contribute to group membership, not the outcome of interest.

6/11/2010


Pre-Matching Propensity Scores

N M SD t df p d

Non Participants 16 .380 .226 2.534 28 .017 .942

LLC Participants 14 .565 .161

(.380)Non- LLC

(.565)LLC

0 1

Unlikely to be in LLC

Likely to be in LLC

Amount of Bias

6/11/2010


Propensity Score Matching

Balance groups on covariates though either matching, regression adjustment, and stratification Stratification across quintiles is the recommended and most

common method.

Shown to reduce approximately 90% of bias due to covariates (Rubin & Rosenbaum, 1983b; Rubin & Rosenbaum, 1984; Shadish, Luellen, & Clark, 2005)

Caliper matching can also substantially reduce bias (Rosenbaum and Rubin, 1985b). A caliper of 0.25 standard deviations of the logit

transformation of the propensity score can also work well to reduce bias (Stuart & Rubin, 2007, ¶4.3.3).

6/11/2010


Matching Algorithms

MatchIt in R (Ho, Imai, King, and Stuart, 2007) PSMATCH2 algorithm in STATA (Leuven & Sianesi,

2004)SUGI 214-26 “GREEDY” Macro in SAS (D’Agostino,

1998),SPSS algorithm (Painter, 2009)

Core code written by Raynald Levesque and adapted for use with propensity matching by John Painter Feb 2004

Program developed and tested with SPSS 11.5 Procedure will find best match for each treatment case from

the control cases Control case is then removed and not reconsidered for

subsequent matches

6/11/2010


Assessing Matched Samples

Some ways of assessing balance (Rubin, 2001) The standardized difference in the mean propensity

score in the two groups should be near zero (d < .20), The ratio of the variance of the propensity score in the

two groups should be near one, preferably between 0.80 and 1.25

6/11/2010


Pre-Matching Propensity Scores

N M SD t df p d

Non Participants 16 .380 .226 2.534 28 .017 .942


(.380)Non- LLC

(.565)LLC

0 1


Likely to be in LLC

Amount of Bias

6/11/2010


Post-Matching Propensity Scores

N M SD t df p d

Non Participants 8 .484 .177 .092 14 .928 .047


(.487)Non- LLC

(.476)LLC

0 1


Likely to be in LLC

6/11/2010


Histogram of Post-Matching PS Differences

6/11/2010


Pre-Matching Achievement Scores

N M SD t df p d

Non Participants 16 3.21 .323 1.795 28 .084 .660


(3.21)Non- LLC

(3.43)LLC

3.0 4.0

Biased Treatment

Effect

6/11/2010


Post-Matching Achievement Scores

N M SD t df p d

Non Participants 8 3.32 .249 .816 14 .428 .384


(3.32)Non- LLC

(3.44)LLC

3.0 4.0

True Treatment

Effects

6/11/2010


Limitations & Cautions

Algorithms across various platforms make different assumptions about how to treat data.

Matched data sets tend to be more homogenous than in randomized samples

Pre-matched sample (n) and post-matched sample (n) will not equal and should be taken into account with regard to statistical power.

Propensity score matching typically requires larger sample sizes.

6/11/2010


References

D’Agostino, R. B. (1998). Tutorial in biostatistics: Propensity score methods for bias reduction in the comparison of treatment to a non-randomized control group. Statistics in Medicine, 17, 2265-2281.National Research Council (2000). Scientific research in education. Washington, D.C.: National Academy Press.

Glenn, D. (2005, March). New federal policy favors randomized trials in education research. The Chronicle of Higher Education, Retrieved December 5, 2009 from http://www.chronicle.com.

Grunwald, H.E. & Mayhew, M.J. (2008). The use of propensity scores in identifying a comparison group in a quasi-experimental design: Moral reasoning development as an outcome. Research in Higher Education, 49(8), 758-775.

Ho D., Imai, K., King, G.,& Stuart, E. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis, 15, 199-236.

Leuven, E., & Sianesi, B. (2004). PSMATCH2: Stata module to perform full Mahalanobis and propensity score matching, common support graphing, and covariate imbalance testing, Statistical Software Components S432001, Boston College Department of Economics.

Morgan, S. L. (2001). Counterfactuals, causal effect heterogeneity, and the Catholic school effect on learning. Sociology of Education, 74, 341–374.6/11/2010


References

Morgan, S., & Harding, D. (2006).Matching estimators of causal effects: Prospects and pitfalls in theory and practice. Sociological Methods & Research, 35(1), 3-60. DOI: 10.1177/0049124106289164.

Painter, J. (2009). Jordan institute for families: Virtual research community. Retrieved from http://ssw.unc.edu/VRC/Lectures/index.htm.

Pike, G. (2009). The differential effects of on- and off-campus living arrangements on students’ openness to diversity. Journal of Student Affairs Research & Practice, 46(4), 629-645.

Rosenbaum, P. R., & Rubin, D. B. (1983b). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55.

Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79(387), 516-524

Rubin, D. B. (2001). Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services & Outcomes Research Methodology 2, 169–188.

6/11/2010


References

Schafer, J. L., & Kang, J. (2008). Average causal effects from nonrandomized studies: A practical guide and simulated example. Psychological Methods, 13(4), 279-313. doi:10.1037/a0014268.

Schneider, B., Carnoy, M., Kilpatrick, J., Schmidt, W. H., & Shavelson, R. J. (2007). Estimating causal effects using experimental and observational designs (report from the Governing Board of the American Educational Research Association Grants Program). Washington, DC: American Educational Research Association.

Shadish W. R., Luellen J. K., & Clark M. H. (2005). Propensity scores: An introduction and experimental test. Evaluation Review, 29(6), 530-558. doi:10.1177/0193841X0575596.

Shadish W. R., Luellen J. K., & Clark M. H. (2006). Propensity scores and quasi-experiments: A testimony to the practical side of Lee Sechrest. In: Bootzin R.R., McKnight P.E. (Eds.), Strengthening research methodology: Psychological measurement and evaluation. American Psychological Association: Washington, DC, 143–157.

Stuart, E. A., & Rubin, D. B. (2008). Matching methods for causal inference: Designing observational studies. In: Obsborne, J. (Eds.), Best practices in quantitative methods. Thousand Oaks, CA: Sage Publishing.

6/11/2010

NASPA ASSESSMENT & RETENTION CONFERENCE, JUNE 11, 2010 FORREST LANE UNIVERSITY OF NORTH TEXAS Why...

Documents

Transcript of NASPA ASSESSMENT & RETENTION CONFERENCE, JUNE 11, 2010 FORREST LANE UNIVERSITY OF NORTH TEXAS Why...