Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC)...

36
11/8/2013 1 Graduate School of Education Thoughts on Teacher Compensation, Evaluation & Measuring Teacher Effectiveness Bruce D. Baker Rutgers University Graduate School of Education Part I Compensation/Work Policies Compensation Levels & Structures Degree Levels & Experience “Merit” Pay (paying “great” teachers more) Teacher Assignment (to schools) “Mutual Consent”& Seniority Provisions Teacher Dismissal LIFO vs. “Quality Based” Layoffs?

Transcript of Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC)...

Page 1: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

1

Graduate School of Education

Thoughts on Teacher Compensation, Evaluation & 

Measuring Teacher Effectiveness

Bruce D. Baker

Rutgers University

Graduate School of Education

Part I Compensation/Work Policies

• Compensation Levels & Structures

– Degree Levels & Experience

– “Merit” Pay (paying “great” teachers more)

• Teacher Assignment (to schools)

– “Mutual Consent”& Seniority Provisions

• Teacher Dismissal

– LIFO vs. “Quality Based” Layoffs?

Page 2: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

2

Graduate School of Education

Payscales & Parameters

• Current Policy Emphasis– We must pay the “great” teachers more, and everyone else, not so much. 

– Removal of any pay increments for: • Experience

• Degree Level

– Removal of any preferences related to seniority/experience

– Replacement of these measures with quantifiable, centralized rating & ranking schemes. 

Graduate School of Education

Misrepresenting the “research” on Graduate Degrees & Credentials

• “For example, it’s clear from abundant research that paying teachers only on the basis of their degrees and years of experience is not in the best interest of students or teachers. As the National Council on Teacher Quality, a research and policy organization whose board of directors I chaired for several years, put it, “the evidence is conclusive that master’s degrees do not make teachers more effective.”– Andrew Rotherham in Time Magazine

• Logical conclusion ‐ we should prohibit outright any compensation being based on masters degrees or on experience & should use the savings (from taking away all of that money from those who already have it?) to pay for easily measurable teaching quality! (an obvious revenue‐neutral path to dramatic improvements!)

Page 3: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

3

Graduate School of Education

The question not asked• What has been studied (the source of this “conclusive” evidence)

– Do teachers who hold general masters degrees, versus those who do not, scattered across a variety of settings, show differences in the average tested outcome gains of their students? 

– Do teachers at varied levels of experience, scattered across a variety of settings, show differences in the average outcome gains of their students? 

– Mostly studies from the late 1990s involving data on the 8th grade cohort of 1988.

• These studies actually found positive results regarding content area degrees in math.  

• What we do not know!– Studies of the association between different levels of experience and the 

association between having a masters degree or not and student achievement gains have never attempted to ask about the potential labor market consequences of stopping providing additional compensation for teachers choosing to further their education – even if only for personal interest – or stopping providing any guarantee that a teacher’s compensation will grow at a predictable rate over time throughout the teacher’s career. 

– The adverse labor market effects may be particularly strong if we replace predictable salary increments (however frustrating) with very noisy performance measures significantly outside control of teachers. 

11/8/2013 5

Graduate School of Education

Practical Responses• Degrees & credentials are rewarded in compensation structures for a variety of reasons, including but not limited to:– The desire to have teachers receive specialized training and obtain additional credentials to serve special student populations and/or be more flexible team players (ability teach additional content areas/grade levels)

• These additional degrees may be highly unlikely to yield immediate math test score gains, but that doesn’t mean they are meaningless/useless

– The desire to employ and retain teachers interested in advancing their own professional learning, whether related to the generation of immediate test score gains or not. 

Page 4: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

4

Graduate School of Education

Where are those masters degrees?

Graduate School of Education

Who’s giving the biggest bump?

Page 5: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

5

Graduate School of Education

NOTE – It’s not as if teachers all of the sudden become ineffective when they gain more experience!

Might just want to retain these teachers!? 

Graduate School of Education

From NJ Private School Web SitesThe Pingry School has more than 200 full‐ and part‐time employees. Approximately two‐thirds of the faculty hold master’s degrees, and 16 faculty members hold doctorates. Average tenure is 13 years.http://www.pingry.org/page.cfm?p=285

Faculty80% of the faculty have advanced degrees; 5 have earned doctorate degreesOn average, the faculty has 22 years of teaching experiencehttp://www.newarka.edu/quickfacts

Small classes are a hallmark of the MKA experience, and individual attention is guaranteed from a skilled and passionate faculty who are at the heart of the school's success. Thanks to a rigorous hiring process, periodic performance reviews and a commitment to innovative, ongoing faculty development, MKA teachers offer more than impressive credentials; they are experts in their fields who share and instill their love of learning, who coach, advise, nurture and inspire. 

FACULT YFull Time: 104; Teaching Fellows, 6; Part Time: 4Degrees: Doctorates, 11; Masters, 88; Bachelors, 15Student/Faculty Ratio: 7:1Average Class Size: 14 studentsAverage Total Students per Master Per Term: 40http://www.lawrenceville.org/data/files/gallery/ContentGallery/Information_Sheet_2013.pdf

Page 6: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

6

Graduate School of EducationReturns to Experience/Age for Teachers and Non‐Teachers (at fixed degree level, location)

$40,000

$50,000

$60,000

$70,000

$80,000

$90,000

$100,000

$110,000

$120,000

23 28 33 38 43 48 53 58

Age/Experience (23 = Year 1)

Inco

me

from

Wag

e (S

alar

y)

Public School Increase

Non-teacher Increase

Data Sources: Non‐Teacher Wages from US Census 2000, American Community Survey 2005 ‐ 2008 based on regression model of wages controlling for age, location, degree level and year. Teacher wages based on NJDOE Personnel Files also using regression model controlling for experience, degree level, location, position type and year.

Graduate School of Education

What Does the Research Most Consistently Say over the Long Haul? 

• Teacher wages (relative wage) matter!

– Among/between local public districts, the relative wage of teachers with X,Y,Z credentials to their peers in other districts in the same labor market.

– Over the longer haul, the relative expected short and longer term wages and benefits of pursuing teaching as a career compared to other available alternatives, usually within the same region. 

Page 7: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

7

Graduate School of Education

What do we know about Average Teacher Quality & Salaries?

• Murnane and Olson (1989) find that salaries affect the decision to enter teaching and the duration of the teaching career.[1]

• Figlio (1997, 2002) and Ferguson (1991) find that higher salaries are associated with better qualified teachers[2]

• Loeb and Page (1998, 2000) find that raising teacher wages by ten percent reduces high school dropout rates by between three and six percent and increases college enrollment rates by two percent.[3]

[1] Richard J. Murnane and Randall Olsen (1989) The effects of salaries and opportunity costs on length of state in teaching. Evidence from Michigan. Review of Economics and Statistics 71 (2) 347‐352

[2] David N. Figlio (1997) Teacher Salaries and Teacher Quality. Economics Letters 55 267‐271. David N. Figlio (2002) Can Public Schools Buy Better‐Qualified Teachers?” Industrial and Labor Relations Review 55, 686‐699.  Ronald Ferguson (1991) Paying for Public Education: New Evidence on How and Why Money Matters. Harvard Journal on Legislation.  28 (2) 465‐498. 

[3] Susanna Loeb and Marianne Page (2000) Examining the link between teacher wages and student outcomes: the importance of alternative labor market opportunities and non‐pecuniary variation. Review of Economics and Statistics  82, 393‐408. Susanna Loeb and Marianne Page (19980 Examining the link between wages and quality in the teacher workforce. Department of Economics, University of California, Davis. 

11/8/2013 13

Graduate School of Education

What do we know about Average Teacher Quality & Salaries?

• David Figlio and Kim Rueben (2001) note that, “Using data from the National Center for Education Statistics we find that tax limits systematically reduce the average quality of education majors, as well as new public school teachers in states that have passed these limits.” – Figlio, D.N., Rueben, K. (2001) Tax Limits and the Qualifications of New 

Teachers. Journal of Public Economics. April, 49‐71

• Ondrich, Pas and Yinger (2008) “find that teachers in districts with higher salaries relative to non‐teaching salaries in the same county are less likely to leave teaching and that a teacher is less likely to change districts when he or she teaches in a district near the top of the teacher salary distribution in that county.”– Ondrich, J., Pas, E., Yinger, J. (2008) The Determinants of Teacher 

Attrition in Upstate New York. Public Finance Review 36 (1) 112‐144

Page 8: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

8

Graduate School of Education

About merit pay studies… • For recent studies specifically on the topic of “merit pay,” each of which generally 

finds no positive effects of merit pay on student outcomes, see: – Glazerman, S., Seifullah, A. (2010) An Evaluation of the Teacher Advancement Program in 

Chicago: Year Two Impact Report. Mathematica Policy Research Institute. 6319‐520

– Springer, M.G., Ballou, D., Hamilton, L., Le, V., Lockwood, J.R., McCaffrey, D., Pepper, M., and Stecher, B. (2010). Teacher Pay for Performance: Experimental Evidence from the Project on Incentives in Teaching. Nashville, TN: National Center on Performance Incentives at Vanderbilt University.

– Marsh, J. A., Springer, M. G., McCaffrey, D. F., Yuan, K., Epstein, S., Koppich, J., Kalra, N., DiMartino, C., & Peng, A. (2011). A Big Apple for Educators: New York City’s Experiment with Schoolwide Performance Bonuses. Final Evaluation Report. RAND Corporation & Vanderbilt University.

– Fryer, R. G. (2011). Teacher incentives and student achievement: Evidence from New York City Public Schools (NBER Working Paper No. 16850). Cambridge, MA: National Bureau of Economic Research. 

– Goodman, S. F., & Turner, L. J. (2010). Teacher incentive pay and educational outcomes: Evidence from the New York City Bonus Program. New York: Columbia University. 

• Fryer’s “Loss Aversion” study… (now the go‐to study for those pitching merit pay)

Graduate School of Education

NYC Merit Pay Studies• Fryer (NYC)

– Study authors reported that the bonus program had statistically significant negative impacts on middle school achievement in math (author‐reported effect size of –0.05) and English language arts (effect size of –0.03). In addition, the authors reported a statistically significant difference of –4.4 percentage points in high school graduation rates, reflecting lower graduation rates among students in intervention schools. 

– The study found that the teacher performance bonus program had no statistically significant impacts on elementary school achievement or teacher retention. 

• http://ies.ed.gov/ncee/wwc/pdf/single_study_reviews/wwc_fryer_091713.pdf

• Goodman– The study found that the offer of a schoolwide teacher performance bonus 

program did not have a statistically significant effect on students’ reading achievement in either 2007–08 or 2008–09 or on mathematics achievement in 2007–08. For 2008–09, study authors reported a very small, but statistically significant, negative effect of the bonus program on mathematics achievement. 

• http://ies.ed.gov/ncee/wwc/pdf/single_study_reviews/wwc_nyc_bonus_101612.pdf

• RAND– The study found that the New York City Schoolwide Performance Bonus Program 

had no discernible impact on school Progress Report scores. • http://ies.ed.gov/ncee/wwc/pdf/single_study_reviews/wwc_rand_091713.pdf

Page 9: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

9

Graduate School of Education

About that new study which proved that DC’s IMPACT program “works”

• It didn’t!– It showed that if we take a group of otherwise similar teachers, and arbitrarily tell some they are okay, and others they are bad and their job’s on the line, the latter group is more likely to seek other employment (or leave)

– It showed that if w take a group of otherwise similar teachers, and arbitrarily exclude some from the top performance label (and salary and bonuses that go with it), they will try to exhibit observed behaviors that will move them into that category. 

See: http://schoolfinance101.wordpress.com/2013/10/17/more‐thoughts‐on‐interpreting‐educationaleconomic‐research‐dc‐impact‐study/Dee, T., & Wyckoff, J. (2013). INCENTIVES, SELECTION, AND TEACHER PERFORMANCE: EVIDENCE FROM IMPACT.

Graduate School of Education

Established, experienced charter operators understand what brings in & keeps their best

Page 10: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

10

Graduate School of Education

Established, experienced charter operators understand what brings in & keeps their best

Graduate School of Education

Teacher Assignment

• Some (NCTQ) believe that inequities in teacher quality are largely a within‐district problem, where poor schools in districts get the weakest teachers and rich schools in the same district get the best teachers.– While such problems do exist, this argument ignores the reality that 

most districts don’t have rich schools and poor schools. Some have poor and very poor schools, and others have rich and very rich schools. 

• Therefore, the primary fix for TQ disparities lies in district level teacher contracts and with provisions like “seniority bumping.” – The fix, in their view, is to decentralizing hiring to principals, eliminate 

seniority preferences and include “mutual consent” language in contracts. 

– They also believe this fix can be imposed through federal Title I restriction.

Page 11: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

11

Graduate School of Education

Do Seniority Provisions Actually Exacerbate Within District Inequity?

• Gross & Goldhaber 2010– We conduct an interrupted time‐series analysis of data from 1998‐2005 

and find that the shift from a seniority‐based hiring system to a “mutual consent” hiring system leads to an initial increase in both teacher turnover and share of inexperienced teachers, especially in the district’s most disadvantaged schools. For the most part, however, these initial shocks are corrected within four years leaving little change in the distribution of inexperienced teachers or levels of turnover across schools of different advantage.

– http://www.nctq.org/docs/Mutual_Concent_8049.pdf• Cohen‐Vogel & colleagues (2013)

– Using data from Florida, the authors analyze whether and how CBAs influence the distribution of teacher quality within school districts, paying special attention to staffing rules that grant preferences to senior teachers. They find little evidence that the within‐district variation in teacher quality between more and less disadvantaged schools in Florida is explained by the determinativeness of union contract rules. 

– http://epa.sagepub.com/content/35/3/324.abstract

11/8/2013 21

Graduate School of Education

Practical Response

• “Mutual Consent” Policies…

– Assume that bad decisions regarding placement are only ever made by central office and that good ones made by school principals

– Ignore that principals themselves may be inequitably distributed (by their capacity to make good personnel decisions)

– Ignore that central office actually controls the placement of principals. 

Page 12: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

12

Graduate School of Education

“Quality Based Layoffs” vs. LIFO• The talking point…

– When forced to make tough layoff decisions, doesn’t it make more sense to layoff “ineffective” and keep “great” teachers, rather than being forced to layoff our new, energetic potentially great future teachers? 

• Sounds good but…– It assumes RIF to be a sufficiently common event to have repeated, 

ongoing effect on quality of teacher workforce– It assumes we have a sufficiently accurate measure of teacher 

“greatness”  • That we have that measure, consistently and comparably across all teachers, 

or at least those we might be dismissing

– It assumes the distribution of dismissals would be substantively different than if dismissals were based on seniority

– And thus assumes a cost saving margin that might not (will not) be realized. 

Graduate School of Education

About those simulations that Project Great Results

• They assume, wrongly, that if a 5% across the board cut was imposed, that the cut would yield a 5% across the board reduction of payroll for core content teachers in tested grade levels. – In most cases of which I’m aware, budget reductions do not occur 

first, or uniformly to core classroom teachers for whom “greatness” measures are available. Other stuff gets cut first. 

• They rely on the highly suspect assumption that the replacement pool (down the line) will be equal to the quality of the average current teacher and/or that those reshuffled will remain equally “effective” by the chosen measures.– When the district starts laying off teachers based on suspect 

measures/judgments.   

• Even then the margins are slim…. 

Page 13: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

13

Graduate School of Education

Playing it out… 

The tradeoff being made in this case is a tradeoff  NOT between “keeping quality teachers” versus “keeping old, dead wood” some would argue, but rather the tradeoff between laying off teachers on the unfortunately crude basis of seniority only, versus laying off teachers on a marginally‐better‐than‐random, roll‐of‐the‐dice basis. I would argue the latter may actually be more problematic for the future quality of the teaching workforce! Yes, pundits seem to think that destabilizing the teaching workforce can only make it better. How could it possibly get worse, they argue? Substantially increasing the uncertainty of career earnings for teachers can certainly make it worse.

Graduate School of Education

Part IIVeracity of the Alternative Measures 

• Estimating teacher effects– Basic attribution problems

– Stability Issues

• Decomposing the Signal and the Noise

• SGP and VAM– Attribution?

– False signals in NJ SGP data

• The Toxic Trifecta of Eval Policies

• Debunking Disinformation

Page 14: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

14

Graduate School of Education

Basic Assumption

Teacher Effectiveness = Student Test Scores After ‐ Student Test Scores Before

(“controlling” for a variety of conditions… or well… not)

11/8/2013 27

Graduate School of Education

Temporal Issues & “Treatment” EffectDetermining “before” and “after”

11/8/2013 28

Sept. JuneJune

3rd Grade Test 4th Grade (Spring) Test

4th Grade (Fall) Test

But many VAM’s don’t consider 

variations in summer learning/lag

Page 15: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

15

Graduate School of Education

Put simply…

• Test score data from APRIL to APRIL are used to evaluate the “effectiveness” of the teacher to whom a child is assigned for an HOUR PER DAY, from SEPTEMBER to JUNE. – A lot goes on over the summer, and that varies across kids (by wealth/poverty, etc.)

• Academic summer programs

• Reading/resources at home

• Kumon? Other private investment in education

– A lot goes on outside of that hour per day

– Whose responsible for MAY & JUNE of the prior year?

Graduate School of Education

Summer Learning Lag by Poverty

11/8/2013 30http://epa.sagepub.com/content/23/2/171.abstract

Annual Growth

School‐year Growth

Page 16: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

16

Graduate School of Education

Influence of Seasonality?“The norm sample results imply that students improve their reading comprehension scores just as much (or more) between April and October as between October and April in the following grade.” (Gates MET)

http://www.metproject.org/downloads/Preliminary_Findings‐Research_Paper.pdf

Graduate School of Education

Notes on Stability of Ratings• ONLY “About one quarter to one third of the teachers in the bottom and top quintiles stay in the same quintile from one year to the next while roughly 10 to 15 percent of teachers move all the way from the bottom quintile to the top and an equal proportion fall from the top quintile to the lowest quintile in the next year.” (p. 2)[1]

[1] Sass, T.R. (2008) The Stability of Value‐Added Measures of Teacher Quality and Implications for Teacher Compensation Policy. Urban Institute, http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdfSee also: McCaffrey, Daniel F.; Tim R. Sass; J. R. Lockwood and Kata Mihaly. 2009. "The Intertemporal Variability of Teacher Effect Estimates." Education Finance and Policy, 4(4), pp. 572‐606.

11/8/2013 32

Page 17: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

17

Graduate School of Education

Notes on MisidentificationDue to “random error”

• There is about a 25% chance, if using three years of data or 35% chance if using 1 year of data that a teacher who is “average” would be identified as “significantly worse than average” and potentially be fired

• Of particular concern is the likelihood that a “good teacher” is falsely identified as a “bad” teacher, in this case a “false positive” identification. – According to the study, this occurs 1 in 10 times (given three years of data and 

2 in 10 given only one year). – Also problematic from a policy perspective but perhaps less so from a legal 

perspective ‐ because it results in improper retention rather than improper dismissal ‐ is the equal likelihood of a “false negative error,” that a “bad teacher” is improperly identified as a “good one.” 

Schochet, Peter Z. and Hanley S. Chiang (2010). Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains (NCEE 2010‐4004). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. 11/8/2013 33

Graduate School of Education

Different Tests Yield Different Ratings!

• Sean Corcoran (2010) explains that “Houston has administered two standardized tests every year: the state TAKS and the nationally normed Stanford Achievement Test.” 

• “among those who ranked in the top category (5) on the TAKS reading test, more than 17 percent ranked among the lowest two categories on the Stanford test. Similarly, more than 15 percent of the lowest value‐added teachers on the TAKS were in the highest two categories on the Stanford.” 

Corcoran, Sean P., Jennifer L. Jennings, and Andrew A. Beveridge. 2010. “Teacher Effectiveness on High‐ and Low‐Stakes Tests.” Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI. 

11/8/2013 34

Page 18: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

18

Graduate School of Education

Related Gates Finding

• the correlation between a teacher’s value‐added on the state test and their value‐added on the Balanced Assessment in Math was .377 in the same section and .161 between sections. 

• we estimate the correlation between the persistent component of teacher impacts on the state test and on BAM is moderately large, .54. 

• The correlation in the stable teacher component of ELA value‐added and the Stanford 9 OE was lower, .37. 

http://www.metproject.org/downloads/Preliminary_Findings‐Research_Paper.pdf

11/8/2013 35

Graduate School of Education

Gates Findings

11/8/2013 36

VARIABLE  DIFFERENT SECTION PRIOR YEAR

TestTotal 

Variance Correlation Stable 

ComponentTotal 

Variance Correlation Stable 

Component

State Math Test  0.053  0.380  0.020  0.040  0.404  0.016 

[0.231]  [0.143]  [0.20]  [0.127] 

State ELA Test  0.032  0.179  0.006  0.028  0.195  0.005 

[0.178]  [0.075]  [0.166]  [0.073] 

BAM Test  0.071  0.227  0.016 

[0.266]  [0.127] 

Stanford 9 OE ELA  0.129  0.348  0.045 

[0.359]  [0.212] 

http://www.metproject.org/downloads/Preliminary_Findings‐Research_Paper.pdf

Page 19: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

19

Graduate School of Education

A little stat‐geeking• What’s in a growth or VAM estimate?

– The largest part is random noise… that is, if we look from year to year, across the same teachers, estimates jump around a lot, or vary a lot in “unexplained” and seemingly unpredictable ways.

– The other two parts are: • False Signal, or predictable patterns that are predictable not as a function of anything the teacher is doing, but a function of other stuff outside the teacher’s control, that happens to have predictable influence 

– Student sort, classroom conditions, summer experiences, test form/scale and starting position of students on that scale. 

• True Signal, or that piece of the predictability of change in test score from time 1 to time 2 that might fairly be attributed to the role of the teacher in the classroom. 

Graduate School of Education

Distilling Signal from Noise

Total Variation

Unknown & Seemingly Unpredictable Error(Random)

Predictable Variation (Persistent Effect)

Attributable to Teacher? 

Attributable to other Persistent Attributes? 

Difficult if not implausible to accurately parse

Page 20: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

20

Graduate School of Education

-10

12

Va

lue

Add

ed 2

009-

10

-.4 -.2 0 .2 .4 .6Value Added 2008-09

Other Good to BadBad to Good Average

Bad Good

Correlation=.327

English Language Arts Grades 4 to 8

9 to 15% (of those who were “good” or were “bad” in the previous year) move all the way from good to bad or bad to good. 20 to 35% who were “bad” stayed “bad” & 20 to 35% who were “good” stayed “good.” And this is between the two years that show the highest correlation for ELA.

Graduate School of Education

-1-.

50

.51

1.5

Va

lue

Add

ed 2

009-

10

-.5 0 .5 1 1.5Value Added 2008-09

Other Good to BadBad to Good Average

Bad Good

Correlation=.5046

Mathematics Grades 4 to 8

For math, only about 7% of teachers jump all the way from being bad to good or good to bad (of those who were “good” or “bad” the previous year), and about 30 to 50% who were good remain good, or who were bad, remain bad.

Page 21: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

21

Graduate School of Education

But is the signal we find real or false? • Math – Likelihood of being labeled “good”

– 15% less likely to be good in school with higher attendance rate

– 7.3% less likely to be good for each 1 student increase in school average class size

– 6.5% more likely to be good for each additional 1% proficient in Math

• Math – Likelihood of being repeatedly labeled “good”– 19% less likely to be sequentially good in school with higher attendance rate (gr 4 

to 8)

– 6% less likely to be sequentially good in school with 1 additional student per class (gr 4 to 8)

– 7.9% more likely to be sequentially good in school with 1% higher math proficiency rate. 

• Math Flip Side – Likelihood of being labeled “bad”– 14% more likely to be bad in school with higher attendance rate. 

– 7.9% more likely to be sequentially bad for each additional student in average class size (gr 4 to 8)

Graduate School of Education

How many NYC teachers are Irreplaceable?

Page 22: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

22

Graduate School of Education

Figure 1 – Who is irreplaceable in 2006‐07 after being irreplaceable in 2005‐06?

020

4060

8010

0%

ile 2

006-

07

0 20 40 60 80 100%ile 2005-06

OK to Stinky Teachers Awesome Teachers

Awesomeness

Awesome x 2

Important Tangent: Note how spreading data into percentiles makes pattern messier!

Graduate School of Education

Figure 2 – Among those 2005‐06 Irreplaceables, how do they reshuffle between 2006‐07 & 2007‐08?

020

4060

8010

0%

ile 2

007-

08

0 20 40 60 80 100%ile 2006-07

OK to Stinky Teachers Awesome Teachers

Awesome x 3

Page 23: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

23

Graduate School of Education

Figure 3 – How many of those teachers who were totally awesome in 2007‐08 were still totally awesome in 2008‐09?

020

4060

8010

0%

ile 2

008-

09

0 20 40 60 80 100%ile 2007-08

OK to Stinky Teachers Awesome Teachers

Awesome x 4?[but may have dropped out one prior year]

Graduate School of Education

020

4060

8010

0%

ile 2

009-

10

0 20 40 60 80 100%ile 2008-09

OK to Stinky Teachers Awesome Teachers

Figure 4 – How many of those teachers who were totally awesome in 2008‐09 were still totally awesome in 2009‐10?

Page 24: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

24

Graduate School of Education

Persistently Irreplaceable?

Of the thousands of teachers for whom ratings exist for each year, there are 14 in math and 5 in ELA that stay in the top 20% for each year! 

Sure hope they don’t leave!

Graduate School of Education

The SGP Difference!

New Jersey’s Violation of the Most Basic Rules of Attribution

Page 25: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

25

Graduate School of Education

SGPs & New Jersey• Student Growth Percentiles are not designed for inferring teacher influence on student outcomes.

• Student Growth Percentiles do not (even try to) control for various factors outside of the teacher’s control.

• Student Growth Percentiles are not backed by research on estimating teacher effectiveness. By contrast, research on SGPs has shown them to be poor at isolating teacher influence.

• New Jersey’s Student Growth Percentile measures, at the school level, are significantly statistically biased with respect to student population characteristics and average performance level.

Graduate School of Education

In the authors words…Damian Betebenner (creator of the Colorado Growth Model):

“Unfortunately Professor Baker conflates the data (i.e. the measure) with the use. A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.”

http://www.ednewscolorado.org/voices/student‐growth‐percentiles‐and‐shoe‐leather

Page 26: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

26

Graduate School of Education

Playing Semantics… 

• When pressed on the point that GPs are not designed for attributing student gains to their teachers, those defending their use in teacher evaluation will often say… – “SGPs are a good measure of student growth, and shouldn’t teachers be accountable for student growth?”

• Let’s be clear here, one cannot be accountablefor something that is not rightly attributable to them! 

Graduate School of Education

New Jersey’s MGPsDistilling Signal from Noise

But the Signal is False!

Page 27: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

27

Graduate School of Education

020

4060

80M

edia

n SG

P M

ath

0 20 40 60 80 100% Proficient 7th Grade Math

correlation=.54

Schools Including 7th GradeNew Jersey SGPs & Performance Level

Is it really true that the most effective teachers are in the schools that already have high proficiency rates? 

Strong FALSE signal (bias)

Graduate School of Education

2040

6080

Med

ian

SGP

Lan

guag

e A

rts

0 20 40 60 80 100% Proficient 7th Grade Language Arts

correlation=.57

Schools Including 7th Grade

New Jersey SGPs & Performance Level

Is it really true that the most effective teachers are in the schools that already have high proficiency rates? 

Strong FALSE signal (bias)

Page 28: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

28

Graduate School of Education

3040

5060

70L

angu

age

Art

s M

GP

0 .2 .4 .6 .8 1% Black or Hispanic

correlation=-.4755

Grades 06-08 SchoolsMiddle School MGP Racial Bias

Is it really true that the most effective teachers are in the schools that serve the fewest minority students?

Strong FALSE signal (bias)

Graduate School of Education

2040

6080

Mat

h M

GP

0 .2 .4 .6 .8 1% Black or Hispanic

correlation=-.3260

Grades 06-08 Schools

Middle School MGP Racial Bias

Is it really true that the most effective teachers are in the schools that serve the fewest minority students?

Strong FALSE signal (bias)

Page 29: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

29

Graduate School of Education

Okay… so is it really true that the most effective teachers are in the schools that serve the fewest non‐proficient special education students?

Significant FALSE signal (bias)

Graduate School of Education

External Review of NY State MGPs[which apply ex‐post corrections]

But the study found that New York did not adequately weigh factors like poverty when measuring students’ progress.

“We find it more common for teachers of higher‐achieving students to be classified as ‘Effective’ than other teachers,” the study said. “Similarly, teachers with a greater number of students in poverty tend to be classified as ‘Ineffective’ or ‘Developing’ more frequently than other teachers.”

Andrew Rice, a researcher who worked on the study, said New York was dealing with common challenges that arise when trying to measure teacher impact amid political pressures.

“We have seen other states do lower‐quality work,” he said.

http://www.lohud.com/article/20131015/NEWS/310150042/Study‐faults‐NY‐s‐teacher‐evaluations

Page 30: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

30

Graduate School of Education

Modern Teacher Evaluation PoliciesMaking Certain Distinctions with Uncertain Information

• First, the modern teacher evaluation template requires that objective measures of student achievement growth necessarily be considered in a weighting system of parallel components. – Placing the measures alongside one another in a weighting scheme assumes all 

measures in the scheme to be of equal validity and reliability but of varied importance (utility) – varied weight. 

• Second, modern teacher evaluation template requires that teachers be placed into effectiveness categories by assigning arbitrary numerical cutoffs to the aggregated weighted evaluation components. – That is, a teacher in the 25%ile or lower when combining all evaluation components 

might be assigned a rating of “ineffective,” whereas the teacher at the 26%ile might be labeled effective.

• Third, the modern teacher evaluation template places inflexible timelines on the conditions for removal of tenure. – Typical legislation dictates that teacher tenure either can or must be revoked and the 

teacher dismissed after 2 consecutive years of being rated ineffective (where tenure can only be achieved after 3 consecutive years of being rate effective).

Graduate School of Education

Due Process Concerns• Attribution/Validity

– Face

• There are many practical challenges including whether teachers should be held responsible for summer learning in an annual assessment model, or how to parse influence of teacher teams and/or teachers with assistants. 

• SGPs have their own face validity problem in the authors own words. One cannot reasonably evaluate someone on a measure not attributable to them. 

– Statistical

• That there may be, and likely will be with an SGP, significant persistent bias – that is, other stuff affecting growth such as student assignment, classroom conditions, class sizes, etc. – which render the resulting estimate NOT attributable to the teacher. 

• Reliability

– Lack of reliability of measures, jumping around from year to year, suggests also that the measures are not a valid representation of actual teacher quality. 

• Arbitrary/Unjustifiable Decisions

– Cut‐points imposed throughout the system make invalid assumptions regarding the statistics – that a 1 point differential is meaningful. 

Page 31: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

31

Graduate School of Education

Negotiating Remedies?

• Student stratified random assignment clauses

• Class size restrictions (uniformity) 

• Time of day/schedule cycling 

• Facilities, materials, supplies and equipment uniformity clauses 

– Room lighting, space, temperature

Graduate School of Education

• At best, this applies to 20 to 30% of teachers

– At the middle school level! (which would be the highest)

ArtBusiness

Elementary Generalist

English/LAL

Family/Consumer Sci

Health/PEIndustrial Arts

MathMusic

Science

Social Studies

Special Ed

Support Services

Vocational Ed

World Language

Average NJ Middle School 2008 – 2011 Fall Staffing Reports

1. Requires differential contracts by staffing type

2. Some states/school districts “resolve” this problem by assigning all other teachers the average of those rated:1. Significant “attribution” concern 

(due process)2. Induces absurd practices

3. This problem undermines “reform” arguments that in cases of RIF, quality, not seniority should prevail because supposed “quality” measures only apply to those positions least likely to be reduced. 

Page 32: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

32

Graduate School of Education

Egregious Misrepresentations• NJ Commissioner Christopher Cerf explained:

– “You are looking at the progress students make and that fully takes into account socio‐economic status,” Cerf said. “By focusing on the starting point, it equalizes for things like special education and poverty and so on.”[17] (emphasis added)

• Why this statement is untrue: – First, comparisons of individual students don’t actually explain what happens 

when a group of students is aggregated to their teacher and the teacher is assigned the median student’s growth score to represent his/her effectiveness, where teachers don’t all have an evenly distributed mix of kids who started at similar points (to other teachers). So, in one sense, this statement doesn’t even address the issue.

– Second, this statement is simply factually incorrect, even regarding the individual student. The statement is not supported by research on estimating teacher effects which largely finds that sufficiently precise student, classroom and school level factors do relate to variations not only in initial performance level but also in performance gains. 

[17]http://www.wnyc.org/articles/new‐jersey‐news/2013/mar/18/everything‐you‐need‐know‐about‐students‐baked‐their‐test‐scores‐new‐jersy‐education‐officials‐say/

Graduate School of Education

Further research on this point…• Two recent working papers compare SGP and VAM estimates for teacher 

and school evaluation and both raise concerns about the face validity and statistical properties of SGPs. – Goldhaber and Walch (2012) conclude: “For the purpose of starting 

conversations about student achievement, SGPs might be a useful tool, but one might wish to use a different methodology for rewarding teacher performance or making high‐stakes teacher selection decisions” (p. 30).[6]

– Ehlert and colleagues (2012) note: “Although SGPs are currently employed for this purpose by several states, we argue that they (a) cannot be used for causal inference (nor were they designed to be used as such) and (b) are the least successful of the three models [Student Growth Percentiles, One‐Step VAM & Two‐Step VAM] in leveling the playing field across schools” (p. 23).[7]

[6] Goldhaber, D., & Walch, J. (2012). Does the model matter? Exploring the relationship between different student achievement‐based teacher assessments. University of Washington at Bothell, Center for Education Data & Research. CEDR Working Paper 2012‐6.[7] Ehlert, M., Koedel, C., &Parsons, E., & Podgursky, M. (2012). Selecting growth measures for school and teacher evaluations. National Center for Analysis of Longitudinal Data in Education Research (CALDAR). Working Paper #80.

Page 33: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

33

Graduate School of Education

Egregious Misrepresentations• “The Christie administration cites its own research to back up its plans, the most favored 

being the recent Measures of Effective Teaching (MET) project funded by the Gates Foundation, which tracked 3,000 teachers over three years and found that student achievement measures in general are a critical component in determining a teacher’s effectiveness.”[23]

• The Gates Foundation MET project did not study the use of Student Growth Percentile Models. Rather, the Gates Foundation MET project studied the use of value‐added models, applying those models under the direction of leading researchers in the field, testing their effects on fall to spring gains, and on alternative forms of assessments. Even with these more thoroughly vetted value‐added models, the Gates MET project uncovered, though largely ignored, numerous serious concerns regarding the use of value‐added metrics. External reviewers of the Gates MET project reports pointed out that while the MET researchers maintained their support for the method, the actual findings of their report cast serious doubt on its usefulness.[24]

• [23]http://www.njspotlight.com/stories/13/03/18/fine‐print‐overview‐of‐measures‐for‐tracking‐growth/

• [24] Rothstein, J. (2011). Review of “Learning About Teaching: Initial Findings from the Measures of Effective Teaching Project.” Boulder, CO: National Education Policy Center. Retrieved [date] from http://nepc.colorado.edu/thinktank/review‐learning‐about‐teaching. [accessed 2‐may‐13]

Graduate School of Education

Egregious Misrepresentations• But… even though these ratings look unstable from year to year, they are 

about as stable as baseball batting averages from year to year, and clearly batting average is  “good” statistic for making baseball decisions? 

• Not so, say the baseball stat geeks: – Not surprisingly, Batting Average comes in at about the same consistency for 

hitters as ERA for pitchers. One reason why BA is so inconsistent is that it is highly correlated to Batting Average on Balls in Play (BABIP)–.79–and BABIP only has a year‐to‐year correlation of .35.

– Descriptive statistics like OBP and SLG fare much better, both coming in at .62 and .63 respectively. When many argue that OBP is a better statistic than BA it is for a number of reasons, but one is that it’s more reliable in terms of identifying a hitter’s true skill since it correlates more year‐to‐year.http://www.beyondtheboxscore.com/2011/9/1/2393318/what‐hitting‐metrics‐are‐consistent‐year‐to‐year

Put simply, VAM estimates ARE about as useful as batting average – NOT VERY!

Page 34: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

34

Graduate School of Education

And one more… • While the Chetty, Friedman and Rockoff studies suggest that variation many, many years ago, absent high stakes assessment, in NYC, across classrooms of kids, are associated with small wage differences of those kids at age 30 (thus arguing that teaching quality – as measured by 

variation in classroom level student gains), this study has no direct implications for what might work in hiring, retaining and compensating teachers. 

• The presence of variation across thousands of teachers, even if loosely correlated with other stuff, provides little basis for identifying any one single teacher as persistently good or bad. 

Graduate School of Education

Clearing up a few additional myths• Teaching is plainly and obviously the only profession anywhere where you get more pay just for collecting credits and getting older.

• In the “real world” (read “private sector” employment) everyone consistently and objectively evaluated on their performance, and compensated and/or dismissed in accordance with those evaluations. 

• And since the private sector is necessarily consistently more innovative, productive and efficient, we know this stuff works. Therefore it will – it must ‐ work in schools!

Page 35: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

35

Graduate School of Education

Strict Metrics, Eval‐in‐a‐Can & Employee Deselection at Microsoft

“Major restructuring at any company is almost always traumatic, but Microsoft’s ultra‐competitive corporate culture will amplify the impact.

Last year a Vanity Fair magazine story described Microsoft’s debilitating employee ranking system, in which team leaders are forced to hand out reviews based on a quota system. So at least one member of each group will get a bad review, no matter how well they perform.

That system has fostered a lack of cooperation and vicious office politics, a malady that is said to run through the entire company at all levels.”

http://www.marketoracle.co.uk/Article41350.html

Graduate School of Education

On the Effectiveness of “Portfolio” Models in the Private Sector?[the story of Sears] 

Although Lampert is notoriously media-averse, he agreed to answer questions about Sears’s organizational model via e-mail. “Decentralized systems and structures work better than centralized ones because they produce better information over time,” Lampert writes. “The downside is that, to some, it appears messier than centralized systems.” Lampert adds that the structure enables him to evaluate the individual parts of Sears, so he can collect “significantly better information and drive decision-making and accountability at a more appropriate level.”

Lampert created the model because he wanted deeper data, which he could use to analyze the company’s assets. It’s why he hired Paul DePodesta, the Harvard-educated statistician immortalized by Michael Lewis in his book Moneyball: The Art of Winning an Unfair Game, to join Sears’s board. He wanted to use nontraditional metrics to gain an edge, like DePodesta did for the Oakland Athletics in Moneyball and is trying to repeat in his current job with the New York Mets. Only so far, Lampert’s experiment resembles a different book: The Hunger Games.

http://www.businessweek.com/articles/2013‐07‐11/at‐sears‐eddie‐lamperts‐warring‐divisions‐model‐adds‐to‐the‐troubles

Page 36: Graduate School of Education · Graduate School of Education NYC Merit Pay Studies • Fryer (NYC) – Study authors reported that the bonus program had statistically significant

11/8/2013

36

Graduate School of Education

The Bottom Line? 

• Policymakers seem to be moving forward on implementation of policies that display baffling ignorance of basic statistical principles –– that one simply cannot draw precise conclusions (and thus make definitive decisions) based on imprecise information. 

• Can’t draw a strict cut point through messy data. Same applies to high stakes cut scores for kids. 

– That one cannot make assertions about the accuracy of the position of any one point among thousands, based on the loose patterns we find in these types of data. 

• Good data informed decision making requires deep nuanced understanding of statistics, measures, what they mean… and most importantly WHAT THEY DON’T!(and can’t possibly)

Graduate School of Education

Reasonable Alternatives?• To the extent these data can produce some true signal amidst the 

false signal and noise, central office data teams in large districts might be able to use several (not just one) rich, but varied models to screen for variations that warrant further exploration. 

• This screening approach, much like high‐error‐rate rapid diagnostics tests, might tell us where to focus some additional energy (that is, classroom and/or school observation). 

• We may then find that the signal was false, or that it really does tell us something either about how we’ve mismatched teachers and assignments, or the preparedness of some teachers. 

• But, the initial screening information should NEVER dictate the final decision (as it will under Toxic Trifecta models). 

• But, if we find that the data‐driven analysis more often sends us down inefficient pathways, we might decide it’s just not worth it. 

But this cannot be achieved by centralized policy or through contractual agreements. 

Unfortunately current policies and recent contractual agreements prohibit thoughtful, efficient strategies! 

Screening

Observation

Validation[or NOT]

& Questions?