Issues relating to Large-scale Assessments

1

Issues relating to Large-scale Assessments

Margaret WuVictoria University

2

International large-scale assessmentsMain problem: interpretations of the resultsFocus on country rankingsAn example:

In August 2012, Julia Gillard, then Prime Minister of Australia, declared that Australia would strive to be ranked in the ‘top five’ in international education assessments by 2025.

So strong is this ambition that it has been inscribed into the Australian Education Act of 2013 as its very first objective, which reads: ‘Australia to be placed, by 2025, in the top 5 highest performing countries based on the performance of school students in reading, mathematics and science’ (Australian Education Act, 2013, p. 3)

3

Does high ranking mean good education system?A Vietnamese researcher queried why Vietnam did well in

PISA despite poor education system in VietnamGorur & Wu, 2014, (Former OECD official– interview

transcript):What’s the good of [the rankings]? what is the benefit to the US to be told that it is number seven or number 10? It’s useless, meaningless, except for a media beat up and political huffing and puffing. It’s very important for the US to know, having defined certain goals like improving participation rates for impoverished students from suburbs in large cities – whether in fact that is happening, and if it is, why it is happening and if not, why not. And it is irrelevant whether Chile or Russia or France is doing better or worse – that doesn’t help one bit – in fact it probably hinders. Makes people feel uncertain, unsure, nervous, and they rush over there and find out why they are doing better.

4

And rushed there, they did…(Australian) Grattan Institute’s Catching Up:

Learning from the Best School Systems in East Asia (Jensen et al., 2012)… researchers from Grattan Institute visited the

four education systems [Hong Kong, Shanghai, Korea and Singapore] studied in this report. They met educators, government officials, school principals, teachers and researchers. They collected extensive documentation at central, District and school levels. Grattan Institute has used this field research and the lessons taken from the Roundtable to write this report (p. 6)

5

Suggested factors for high ranking (performance)One observation made by the Grattan

Institute…“Shanghai, for example, has larger class sizes

to give teachers more time for school-based research to improve learning and teaching.” (p.2)

(Observation also made by OECD PISA, 2010)New Zealand government proposed to

increase class size to free up money to fund initiatives to raise the quality of teaching (NZ Treasury briefing paper, March, 2012)

6

Discussion pointsThese “policies” are often said to be

“evidence-based”, where large-scale assessments are frequently quoted as the sources of evidence.

Why should we be concerned with these policies?

ConsiderValidity - issuesReliability - issues

7

Validity issuesLinking factors to performanceKorea and China perform well, and have large class

sizes.Can we conclude large class size leads to good

performance?Making inferences:

No. of storks positively correlated with no. of babies born

Crime rate positively correlated with ice cream salePeople who take care of their teeth have better general

healthMediating variables at play

8

Linking PISA to PoliciesPISA tells us about student performance, and

background of students/schools/countriesLinking background to performance is done

by people, not proven by statistics.Any interpretation is an inference.PISA cannot substantiate the validity of the

inferences.Need other in-depth studies.

9

A common misunderstanding about statistical analysisregression equation Y = a + bX

X is termed explanatory variableY is termed dependent variable

Does X explain Y?Try X = a + bYExactly the same resultsRegression does not test for causal inference.

Regression only reflects correlation.

10

Regress Reading on GDP scoresCoefficientsa

ModelUnstandardized

Coefficients

Standardized

Coefficients

t Sig.BStd.

Error Beta1 (Constant) 479.828 10.310 46.54

2.000

GDP .427 .301 .243 1.416 .167a. Dependent Variable: Reading

Coefficientsa

ModelUnstandardized

Coefficients

Standardized

Coefficients

t Sig.BStd.

Error Beta1 (Constant

)-36.464 48.202 -.756 .455

Reading .138 .098 .243 1.416 .167a. Dependent Variable: GDP

11

Reliability Issues

How strong is the relationship between two variables?

P value = 0.11

n.s. at 95% level

Small class size

and/or low teachers’

salaries

Large class size and

high teachers’

salaries

Number of countries

performed higher

than OECD average in

reading / Total

number of countries

Low cumulative

expenditure on

education

3 out of

31 countries

performed higher

than OECD average

in reading.

3 out of

12 countries

performed higher

than OECD average

in reading

6/43

High cumulative

expenditure on

education

8 out of

20 countries

performed higher

than OECD average

in reading

2 out of

2 countries

performed higher

than OECD average

in reading

10/22

Number of countries

performed higher

than OECD average

in reading / Total

number of countries

11/5

15/14 16/65

12

Top five in what?Interview transcript of a senior OECD official (Gorur & Wu):OECD Official: Well, Australia is doing pretty well!RG: It’s doing well, right? But you know what we want to do

now? Our Prime Minister says we want to be in the top five in PISA!

OECD Official: Top five in what? RG: In PISA.OECD Official: Yes, but for which students? The average

student in Canada, in Korea, Finland, Shanghai, China – that’s one thing. If you then look at high performing students or how low performing students do, then we may get a completely different picture. And that’s where policy efforts are most interesting for me.

13

Australian 2009 PISA Reading results, by state

State Mean score Confidence interval

ACT 531 520–543WA 522 510–534QLD 519 505–532NSW 516 505–527VIC 513 504–523SA 506 497–516TAS 483 472–495NT 481 469–492Australia 515 510–519

In top 5 already

Below OECD

average

14

Ranking by item content Country Item

M408Q01TR Country Item

M420Q01TRHong Kong-China 0.60 New Zealand 0.66

Finland 0.56 Australia 0.64Australia 0.56 Canada 0.64Chinese Taipei 0.55 Ireland 0.62United Kingdom 0.55 Shanghai-China 0.62New Zealand 0.55 United Kingdom 0.60Macao-China 0.53 United States 0.59Iceland 0.52 Chinese Taipei 0.58Ireland 0.51 Singapore 0.57Singapore 0.50 Denmark 0.57

15

Differential Item Functioning (DIF)Australia performed extremely well on Items

M408Q01TR and M420Q01TR, ranking third and second respectively internationally. For ItemM408Q01TR, Shanghai-China ranked 20th, despite the fact that Shanghai took the top spot internationally in mathematics literacy, with a mean score much higher than the second place country, Singapore. For Item M420Q01TR, Australia outperformed all top ranking countries.

In contrast, for Item M462Q01DR, Australia ranked 43 internationally, with an average score of only 0.1 out of a maximum of two, while Shanghai had an average score of 1.5 out of a maximum of two.

16

Implications of DIFAverage score (and ranking) hides DIF.Existence of DIF threatens comparisons

across countries, as the achievement results depend on which items are in the test.

17

An Example - JapanPISA reading

2000: 522 2003: 498 a 24 point drop, about 6 months of growth!

Triggered huge reactions in JapanBlame on reform started two years beforeNew reforms and policies

18

How PISA trends are establishedSelect some items from 2000 as “anchoring

items”Place in 2003 testSo 2003 results can be placed on the 2000

scale

19

Item BiasItems don’t work in the same way in all

countries. One item may be relatively more difficult for

one country than for other countries.Differential Item Functioning (DIF)

20

Differential Item FunctioningHypothetical example:

Item Country A (% correct)

Country B (%Correct)

1 65 762 74 833 42 514 79 855 73 646 72 917 46 54

Biased against B

/Favours A

Biased against A

/Favours B

21

Japan vs International Item Parameters

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

Comparison of International Item Parameter and National parame-ters for Japan

International difficulty (logit scale)

Diffi

culty

for J

apan

(log

it sc

ale)

22

Anchoring items in Reading 2003Many anchoring items were biased against

JapanJapan’s mean score would increase by 10

score points if one particular reading unit was removed from the set of eight anchoring units. (Monseur & Berezner, 2007).

23

Fluctuation of Country ResultsOwing to items selected for a test for reasons

such asCultural differencesLanguage differencesCurriculum differences

24

2000 – 2009 trendsIt has often been claimed that Australia is

slipping in Reading.

25

-13 points

26

What PISA tells us

Big pictureAustralia is doing pretty wellAustralia and New Zealand lead the English

speaking countries(Confucius culture) Asian countries lead in

academic performanceFinland does very well in non Asian countriesMay suggest something for further

investigation

27

Limitations of large-scale assessments

Not able to collect data on all factors related to education.

For example, private spending on education has not been captured

Students’ lives outside schools.Look beyond international ranksFocus on within country comparisonsDon’t jump to conclusions on policy

implications

Issues relating to Large-scale Assessments

Documents

Transcript of Issues relating to Large-scale Assessments