Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third...

156
Proceedings of the Third International SweMaS Conference Umeå, October 14-15, 2003 Torulf Palm (Ed) EM No 48, 2005 ISSN 1103-2685

Transcript of Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third...

Page 1: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

Proceedings of the Third International SweMaS Conference

Umeå, October 14-15, 2003

Torulf Palm (Ed)

EM No 48, 2005

ISSN 1103-2685

Page 2: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure
Page 3: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

Table of Contents

Preface ......................................................................................v Programme for the SweMaS Conference 14-15/10 2003 .................. ix

Developing Attractive Mathematics Curricula and Assessment for Post-16 Students in the UK. Geoff Wake ..............................................................................1 Rasch Modelling in Test Development, Evaluation, Equating and Item Banking Julie Ryan & Julian Williams .....................................................11 Encouraging Excellence through Assessment: Mathematics Examinations in Israel Miriam Amit............................................................................33 Trends of Assessment in German Mathematics Education Gabriele Kaiser ........................................................................51 National Testing On Line How Far Can We Go? Steven Bakker & Gerben van Lent..............................................61 The Swedish National Course Tests in Mathematics Jan-Olof Lindström ..................................................................67 Test Quality Considerations Torulf Palm...........................................................................107 The Test Bank in Biology - Assessment of Experiments and Field Investigations Gunnel Grelsson & Christina Ottander..........................................111 Implementing a Creative Competence Jesper Boesen ......................................................................121 Differential Item Functioning for Items in the Swedish National Test in Mathematics, Course B Gunilla Näsström ...................................................................131

Page 4: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure
Page 5: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

v

Preface Torulf Palm

Introduction

The papers in this volume reflect the issues that were discussed at the Third International Swedish Mathematical and Science tests conference. The confer-ence was the third in a recurrent series of conferences devoted to trends in assessment, particularly in large-scale mathematics assessment, and the research and developmental work around the Swedish national tests in mathematics for upper secondary level and the national test banks in biology, chemistry, physics and mathematics produced by the Department of Educational Measurement, Umeå University, Sweden. The conference was organised by the Department of educational measurement and held in Umeå.

The participants of the conference were members of the department and the International Scientific Advisory Board for the SweMaS that is tied to the department. The board was set up in 2000 and their task is to review, evaluate and advice on the developmental work and research in the national tests projects.

The conference was organised around three themes in line with the purposes of the conference. One theme was international trends in assessment and in parti-cular in large-scale mathematics assessment for upper secondary school level. These sessions were devoted to a trend that is shared by several different coun-tries or regions in the world or an example from a specific country or region. The contributions were made by the different members of the scientific adviso-ry board. A second theme was the development of the tests. The idea is that the tests should be the object of an ongoing process of development. Issues identifi-ed as particularly important for the development of the tests were discussed from an international perspective and from the perspectives of the participants’ different research specialities. Such discussions could benefit both the develop-ment of the national tests and the conference participants. The third theme was the discussion of research around the tests, which this time comprised the work of PhD students tied to the Department. The SweMaS conferences provide good opportunities for thoughts and research ideas to be developed.

Page 6: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

vi

Trends in assessment

The first talk devoted to the theme International trends in assessment was held by Julian Williams of Manchester University, England. He presented two pieces of work. The first was essentially a resume of the work done in recent years for Mathematical modelling in post compulsory Mathematics pre-university courses in the UK. He recalled their development of modelling materials, practical work and assessment tasks for A- level Mathematics. He further drew attention to the more recent work for Free Standing Mathematics Units (mainly taken by students who are not prepared or not motivated to study A level Mathematics) and the new A/S level ’Use of Mathematics’, which was piloted in 2003 and is now in widespread and growing use. The paper by Geoff Wake in this volume expands on this and provides further references to the recent work with ’Use of Mathematics’.

Second, he discussed the problems involved in test equating and item banking, mainly from a statistical (and Rasch model) point of view, though the issues are never only (or even essentially) statistical. The paper by Julie Ryan & Julian Williams explores these issues in more depth, and refers in particular to an earlier publication by Williams & Ryan (2000) where further background is available in relation to diagnostic and formative versus summative aspects of National Assessment.

Miriam Amit, Ben Gurion University of the Negev, Israel, describes the reform, and the underlying reasons for the reform, of the Israeli National Completion Examinations in Mathematics. The examination has since 1997 gone through two stages of reform with the main purpose of increasing the number of students taking the examinations, including the high level mathematics examination, without lowering the mathematical requirements. The solutions included building a more flexible curriculum and assessment system and also partly linking them more strongly to everyday experiences.

Gabriele Kaiser, University of Hamburg, Germany, describes the changes in curriculum and large-scale assessments in Germany. As in the UK and Israel there is a strive to engage more students in mathematics and the reform work shares several components with the reforms in these other two coun-tries such as a modular structure of content and a stronger link between mathematics and the real world outside mathematics.

The contribution by Steven Bakker and Gerben van Lent, ETS Europe, The Netherlands, describes the development of IT applications in large-scale testing and summarizes the talk given by Steven Bakker. They describe the present

Page 7: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

vii

state-of-the-art giving examples from different test production centres. The paper includes a discussion of problems as well as solutions and an idea of what may come in a not to far away future.

Developmental work

For this conference three issues were chosen as focus for the discussion of the developmental work of the national tests and national test banks. These issues were test score comparability, test quality, and the production and assessment of items for the test bank in biology.

Jan-Olof Lindström, Umeå University, introduced the discussion about test score comparability for the different tests. He describes the specific context in which the Swedish national tests and test banks are situated. The system in which they function is quite different from most other countries and the char-acteristics of the system most relevant for the discussion of test score compara-bility of the Swedish national tests are included in the contribution.

The second issue discussed under this theme was test quality. Different criteria for the quality of a test may be used (and are being used around the world) for monitoring and developing assessments. In addition, different emphasis may be put on the chosen criteria depending on the interpretation of the main purposes of the test, which for example can be based on an interpretation of the assign-ment given to the test developers. Torulf Palm, Umeå University, introduced this discussion describing the choice of quality criteria focused in the work on the national tests and test banks produced by the Department and the way the validation of these tests is carried out. The written contribution outlines some of this work.

The third issue under this theme was the production and assessment of items for the test bank in biology. Gunnel Grelsson and Christina Ottander, Umeå University, describes the context in which this test bank is functioning and the considerations that currently are taking place with the aim of providing teachers with high quality tests. Particularly, they discuss the assessment of the students’ scientific way of working, which is an important goal in the biology courses.

Page 8: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

viii

Research linked to the national tests

Jesper Boesen discusses research ideas concerning the implementation of mathematical problem solving in the Swedish school. There are indications that there are important discrepancies between the intentions described by the curriculum and syllabi and the actual outcome in student performance concern-ing this competence. He discusses possible research attempts including the influence of the national tests.

Gunilla Näsström discusses a study of the fairness of one of the national tests, which is one of the criteria for test quality the Department emphasises in the monitoring of the quality of the tests. The study relies in significant parts on statistical methods but also includes a more qualitative approach into one of the items. The discussion includes the results of the study as well as methodological issues.

Page 9: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

ix

Programme for the SweMaS Conference 14-15/10 2003

International members of the board:

Miriam Amit (MA), Steven Bakker (SB), Gabriele Kaiser (GK), Julian Williams (JW)

Participants from DEM:

Widar Henriksson (WH), Jan-olof Lindström (JOL), Torulf Palm (TP), Jesper Boesen (JB), Peter Nyström (PN), Gunilla Näsström (GN), Gunnel Grelsson (GG), Ewa Bergqvist, Ingela Eriksson, Timo Hellström, Carl-Magnus Hägg-ström, Kjell Lundgren, Gunnar Wästle

Programme Tuesday 08.30-09.30 Introduction (WH, JOL, TP)

09.30-10.00 Coffee/Tea

10.00-12.00 Trends in assessment (JW, MA)

12.00-13.00 Lunch: Restaurant at Folkets hus

13.00-14.00 Trends in assessment (SB)

14.15-15.00 Paper discussion (JB)

15.00-15.30 Coffee/Tea

15.30-17.00 Test score comparability: Introduction and discussion (JOL)

19.15- Dinner: Restaurant Kåtan Bergsgården

Wednesday

08.30-09.30 Trends in assessment (GK)

09.30-10.15 Paper discussion (PN)

10.15-10.45 Coffee/Tea

10.45-11.30 Paper discussion (GN)

11.30-12.00 Test quality: Introduction and discussion (TP)

12.00-13.00 Lunch: Restaurant at Folkets hus

13.00-14.00 Test quality: Introduction and discussion continued (TP)

14.00-15.00 The test bank in biology (GG)

15.00-15.20 Coffee/Tea with sandwich

15.20- Closing of the conference (Topics for the next meeting)

Page 10: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

x

Page 11: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

1

Developing Attractive Mathematics Curricula and Assessment for Post-16 Students in the UK. Geoff Wake, University of Manchester

Introduction

Study of mathematics for post-16 students is perhaps inevitably enmeshed with assessment and qualification. When study of mathematics is not compulsory, as is currently the case in the UK, the courses and qualifications offered have to be designed to prove attractive and motivating to students. Here we briefly outline some of the main features of qualifications that have recently been designed and implemented with just such intent.

Background and context

The U.K. differs from other European countries in allowing its post-compulsory (that is, after age 16) school/college population to give up study of mathematics. This is a problem that has been recognised in a number of recent major national reports. A national Skills Task Force (1999) proposed that there should be changes to the education and training system to ensure “a broad curriculum at upper secon-dary level which promotes the wider study of mathematics” (p 43). Most recently in a fully comprehensive Inquiry into the teaching of mathematics post-14 Smith (2004) draws attention to possible factors underlying recent decline in the number of students undertaking mathematical study post-16:

• the perceived poor quality of the teaching and learning experience;

• the perceived relative difficulty of the subject;

• the failure of the curriculum to excite interest and provide appropriate motivation;

• the lack of awareness of the importance of mathematical skills for fu-ture career options and advancement.

It is possible that in coming years we may develop our qualification structure so that the resulting relatively low level of participation in mathematics education will be ameliorated: there is currently a Working Group looking at 14-19

Page 12: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

2

curriculum and qualifications reform in England (Tomlinson, 2004) and initial indications suggest that study of mathematics by post-16 students may eventu-ally become compulsory. However, it is likely that any reforms resulting from proposals made by the Working Group will take some considerable time to develop and implement: the likely time-frame is ten years or more.

In the mean time a recent innovation in the provision of mathematics qualifica-tions for post-16 students has been to develop courses that may prove attractive to the type of students who, in the past, have found the standard provision of academic courses unattractive and have therefore given up their study of mathematics after sitting the terminal General Certificate of Secondary Educa-tion (GCSE) examination at age sixteen. Free Standing Mathematics Qualifica-tions (FSMQs) have been designed to be equally useful to students taking either academic or vocational courses, or in some circumstances, courses combining a mix of both academic and vocational elements. A range of units was designed at each of three levels corresponding to Foundation, Intermediate and Advan-ced Levels of the UK’s National Qualifications Framework (see Figure 1).

Figure. 1: Free Standing Mathematics Qualifications available at three levels of the

UK’s National Qualifications Framework.

Each unit forms a qualification in its own right, and as can be surmised from their titles, each focuses on an area of mathematics to some depth rather than being a more general wide ranging mathematics qualification. At Advanced Level the three FSMQs are recognised individually and may contribute to a student’s achievement

Page 13: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

3

profile when applying for entry to University. However, recognising the impor-tance attached to existing qualifications such as AS (Advanced Subsidiary – usually a one-year course of study in a particular subject) and A Level (this involves a second year of study beyond AS) a terminal unit, Applying mathematics was developed; when this is taken in conjunction with the unit Working with algebraic and graphical techniques and one of the optional units (Modelling with calculus or Using & applying statistics) an overall Advanced Subsidiary award AS Use of Mathematics can be achieved (see Figure. 2 below).

Figure. 2: The modular structure of AS Use of Mathematics.

Curriculum design

The principles underlying design of the new qualifications, and crucially their assessment, are common across all three levels and were based on an analysis of curriculum needs for the target population. The students likely to be involved will not wish to study mathematics for its own sake; those who fall into this category will most likely have been successful at GCSE, and have opted to study AS followed by A Level Mathematics (the “academic” route). The new qualifications therefore recognise that if students from the target group are to elect to study mathematics they must see an immediate value in its use within their experience of other study, work or interests. Thus, the qualifications promote application of mathematics and mathematical modelling in situations that are likely to be meaningful to individual students, whether their study programme is academic, vocational or mixed. The design of the qualifications’ specifications therefore not only addresses issues sur-rounding assessment but also attempts to define an appropriate, motivating curricu-lum and promote best practice in teaching.

The desire that students follow a mathematics course that has at its heart the appli-cation of mathematics raises difficult issues surrounding ‘transfer’ of knowledge.

Page 14: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

4

There has been much debate about the difficulties associated with ‘transfer’ in the mathematics education community; it appears paradoxical, as Molyneux-Hodgson (1999) points out, that a subject that prides itself on the development of general results that have wide and general application, meets with so many difficulties when many attempt to apply mathematics to solve problems in a variety of contexts.

When working on curriculum specification for the new qualifications our group in Manchester developed the construct ‘general mathematical competence’ (for further discussion of this see for example, Williams, Wake & Jervis, 1999, or Wake and Williams, 2001). In summary a general mathematical competence is the ability to perform a synthesis of general mathematical skills across a range of situations, bringing together a coherent body of mathematical knowledge, skills and models. Crucially the construct of general mathematical competence moves us away from thinking of the mathematics curriculum as being a collection of atomised mathe-matical skills towards consideration of ‘big ideas’: common ways in which we bring together and use mathematics to analyse situations and solve problems. The gen-eral mathematical competence implicitly (if not explicitly) recognises the value of teaching about and for transfer as advocated by Anderson, Reder and Simon (1997) and Evans (1999) amongst others.

Of course the assessment of a national qualification has to address issues of reliabil-ity and validity and has to operate within existing norms of national assessment practices. Such considerations, alongside the desire that the qualification should encourage students to solve substantial problems in situations that are meaningful to them, led to a desire for the inclusion of a substantial element of portfolio as-sessment alongside the necessary written examination. Each of the portfolio and written examination assessment components is designed to promote student use of general mathematical competences in focused areas of mathematics, and thus to promote authentic assessment. Although there is debate about the exact meaning of the term ‘authentic’ (for a flavour of the debate see for example Terwilliger (1997) and Newman, Brandt and Wiggins (1998)) we take here the definition offered by Lesh and Lamon (1992), “actual work samples taken from a representa-tive collection of activities that are meaningful and important in their own right.” The qualifications require students to develop a portfolio of their mathematics that demonstrates their learning over the duration of the course. This we believe fully satisfies the spirit of authentic assessment as defined above. Teachers have been encouraged to ensure that the portfolio assessment is as authentic to each individual student as possible by allowing students to work within contexts and with data generated elsewhere in their study of other subjects. Each portfolio is assessed under three themes:

Page 15: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

5

• Structuring and presenting the work

• Using appropriate mathematics and working accurately

• Interpreting mathematics.

Figure 3 gives an example of part of the portfolio requirements for just one of the qualifications. This demonstrates how the assessment specification allows teachers to develop a course that focuses on mathematical modelling with students applying mathematics in a range of different contexts with an emphasis on the use of graphic calculator and computer technology.

Figure. 3: Part of the portfolio demands of the Advanced Free Standing Mathematics

Qualification, Working with Algebraic & Graphical Techniques

The written examinations have been developed to attempt to reflect the type of work that students will have undertaken in producing their portfolios and stress understanding and development of mathematical models. They attempt to pro-mote authenticity by engaging students with ‘real’ data (released up to two weeks in advance of the examination) and substantial contexts that require students to work with the mathematical content and processes they use to produce their portfolio. In one paper at Advanced Level the data released to students comprises of a mathe-matical article on which the examination questions are then based; this allows assessment of “mathematical comprehension” (see Figure 4).

Page 16: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

6

Figure. 4: Part of pre-released mathematical article and sample examination question

for terminal AS unit Applying Mathematics (Source: AQA Awarding Body, 2003)

Page 17: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

7

Supporting curriculum implementation and assessment

A major issue for innovative curricula, such as those intended by the specifica-tions of FSMQs, is ensuring that teachers are supported in implementing the curriculum using appropriate pedagogy in their classrooms and in preparing their students successfully for assessment. In this case immediate support for teachers was made available via a dedicated website hosted by the Nuffield Curriculum Centre (www.fsmq.org). A range of resources that can be downloaded from this website has been developed by our team based at the University of Manchester together with teachers from participating schools and colleges. For an example of the type of data made available see Figure 5 below. Teachers have been encouraged to work with such data in a way that allow learners to develop the mathematics necessary to complete the requirements of the portfolio assessment.

Figure. 5: Data made available for teachers and learners at curriculum support

website www.fsmq.org

Page 18: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

8

Further to this we have written three texts (Haighton, Haworth and Wake 2003a, 2003b, 2004) to support the Advanced FSMQs and developed dynamic spreadsheets that allow teachers to use technology to develop important mathematical concepts. Figure 6 shows a screenshot of one such spreadsheet that teachers can use to demonstrate, or students to investigate, gradient func-tions (or derived functions) in the long tradition of David Tall’s “graphic calculus”. The screenshot attempts to illustrate how the spreadsheet can be animated to show a tangent being drawn at every point of the original function (in this case y = 20+75e-0.1x) and the graph of its gradient function plotted.

Figure. 6: Spreadsheet programmed to allow dynamic exploration of gradient func-

tions.

Conclusion and discussion

Uptake of the use of these new qualifications is relatively low, but increasing sub-stantially each year with approximately 3000 students completing courses in 2002-3. There are many factors that mitigate against the quick acceptance of new qualifi-cations in what is, in effect a qualifications market place, in the UK. For example, it takes considerable time for employers and university admissions tutors to become familiar with the qualifications and whether or not they serve their purposes as an appropriate preparation for Higher Education study or work.

Perhaps more encouraging is the feedback from students and teachers who use the qualifications and report very positively about their experience of the emer-ging implemented curriculum. In particular initial indications are that students

Page 19: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

9

are becoming more confident about their ability to apply mathematics to solve problems in context and they report a considerable increase in motivation due to the emphasis placed on their use of information technology in their learning and as a tool to solve problems.

The lessons we are learning are most important as we begin to consider what a compulsory mathematics curriculum might look like for post-16 students. It would be a mistake to think that compulsion will solve our problems: whatever mathematics courses we put in place need to ensure that students come to appreciate the power of mathematics in allowing them to make sense of the increasingly technological world of work and the immense amount of quantita-tive data they will meet everyday as they go about their business as a citizen.

References Anderson, J.R., Reder, L.M., & Simon, H.A. (1997). Situative versus cognitive

perspectives: Form versus substance. Educational Researcher, 26(1), 18-21.

Evans, J.: 1999, 'Building Bridges: Reflections on the problem of learning in mathematics.' Educational Studies in Mathematics, 39, 23-44.

Haighton, J., Haworth, A. & Wake, G.D., 2003a, AS: Use of Maths, Algebra & Graphs, Cheltenham, Nelson Thornes

Haighton, J., Haworth, A. & Wake, G.D., 2003b, AS: Use of Maths, Statistics, Cheltenham, Nelson Thornes

Haighton, J., Haworth, A. & Wake, G.D., 2004, AS: Use of Maths, Calculus, Cheltenham, Nelson Thornes

Lesh, R. & Lamon, S. (eds.). (1992), Assessment of authentic performance in school mathematics. Washington, DC: Association for the Advancement of Science Press.

Molyneux-Hodgson, S. (1999). Mathematical experiences and mathematical practices in the HE science setting, End of award report; no R000222571.

Newman, F., Brandt, R. & Wiggins, G. (1998), An Exchange of views on “Semantics, Psychometrics and Assessment reform: A close look at “Authentic” Assessments”, in Educational Researcher, Vol. 27/6.

Skills Task Force, Second Report of the National Skills Task Force: Delivering Skills for All, Department for Education and Employment (DfEE), 1999

Smith, A. (2004). Making Mathematics Count. The Stationery Office, London.

Terwilliger, J. (1997), Semantics, Psychometrics and Assessment reform: A close look at “Authentic” Assessments, in Educational Researcher, Vol. 26/8.

Page 20: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

10

Tomlinson, M. (2004) 14-19 Curriculum and Qualifications Reform. Interim Report of the Working Group on 14-19 Reform, DfES, London.

Wake, G.D. and Williams, J.S.: 2001, Using College mathematics in understanding workplace practice. Summative Report of Research Project Funded by the Leverhulme Trust, Manchester University, Manchester.

Williams, J. S., Wake G. D. & Jervis, A. (1999). General mathematical competence in vocational education. In C. Hoyles, C. Morgan & G. Woodhouse (Eds.) Mathematics Education for the 21st Century. Falmer Press, London.

Page 21: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

11

Rasch Modelling in Test Development, Evaluation, Equating and Item Banking Julie Ryan, Liverpool John Moores University Julian Williams, University of Manchester

Abstract

Our project is attempting to develop a bank of items described by diagnostic content, curriculum area, and ‘level’: this involves placing all test items in the bank on a common ‘difficulty’ scale. We argue that the Rasch model is a simple and appropriate model for constructing such a scale. Using data from recent pretests and national tests in Mathematics, we have linked ‘parallel’ tests, diffe-rent tiers, primary and secondary school stages and years using the Rasch family of models. We present some examples of our findings to date. The validity of our results relies on the validity of the Rasch model in the national testing context: we conclude that Rasch provides a stringent modelling tool which can be useful in equating when the data fit the model, and which can provide information about test inequity or differential item functioning when signifi-cant misfit is identified.

Background to the project

This section of the introduction describes our motivation in wanting to create an item bank, and hence the current interest in the problem of scaling, dimen-sionality and test equating. There are, of course, other motivations for taking an interest in equating, such as test validity, equity and reliability of scoring. These issues will also be discussed in the paper tangentially.

In Williams and Ryan (2000) we have described the work of an error analysis project for the Qualifications and Curriculum Authority (QCA) in 1997 which led us to our work in test development. Having identified and analysed diagnostically the errors children made in national mathematics tests, we decided to try to con-struct tests and item banks from items which are rich diagnostically. The aim was:

a) to locate common and pedagogically important errors children make in national curriculum tests by curriculum area, level and age/phase,

Page 22: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

12

b) to disseminate this information to teachers, e.g. through the test mark schemes, and

c) to inform teachers’ mental model of their children and help improve classroom practice.

We were particularly motivated by the work done in the 1980s by the diagnos-tic teaching project here in the UK (Bell et al., 1985) and by the reports of Japanese classroom teaching described in the TIMSS video study (Schmidt et al, 1996); both of these suggest that a crucial aspect of improving mathematics teaching would be the dissemination of pedagogical knowledge about how children respond to problems and the mistakes they tend to make. We argue that teachers need to know likely pupils’ errors in advance of a lesson, when planning, if they are going to teach ‘diagnostically’.

We were also impressed by evidence that diagnostic assessment, when linked to performance on standardised tests, can have a measurable effect on improving performance of children. Sebatane (1998) claims that “the provision of diagnostic information based on the performance on standardised tests of pupils in primary schools, compared to the provision of norm-referenced information only, has been found to improve pupils’ achievement as measured by standardised tests” (Sebatane, 1998, p.124, citing Kellaghan, Madaus & Airasian, 1982).

Finally, our aim to generate an item bank which would organise diagnostic errors by curriculum field and level resonates with the work of Dassa et al (1993, cited in Black & Wiliam, 1998), who organised such a bank “in a three-dimensional scheme: diagnostic context, notional content and cognitive ability, the items being derived from a study of common errors so that they could provide a basis of causal diagnosis. The overall purpose was to help teachers provide formative personal feedback within the constraints of normal class-rooms. Trials in five classrooms showed that, by comparison with a control set of five more classrooms, those using the item bank had superior gains, the mean effect size being 0.7” (Black & Wiliam, 1998, p.46). We noted that such a gain in effect size is impressive, being of the order of magnitude of the differences between the UK and Pacific rim countries in international comparisons, and although the gain was identified in a small number of classrooms the procedure employed had the ‘ring’ of ecological validity. Why do we say this? We believe that the main reason that ‘diagnostic teaching’, along with much other research knowledge about teaching, has not become common knowledge (despite the evidence that it can ‘work’ in terms of pupil achievement) is that it has not been organised in the way teachers need it. The everyday hurly-burly of a classroom

Page 23: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

13

teacher’s life does not naturally support reading and reflection on teaching: the knowledge about children’s errors and thinking needs to be provided in a form which teachers can use, and this means mapping it on to their curriculum plan: by age, content domain, and level of ‘ability’.

Figure 1: Performance descriptions and errors by level, 1997

(Reprinted from Williams & Ryan, 2000)

Page 24: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

14

Thus in our 1997 study (Fox et al, 1997; Doig et al, 1998) we plotted the errors made by children against the curriculum topic and ‘level’ awarded in the test. The Figure 1 shows an example of the performance of seven-year-old children. The figures for the older children are correspondingly richer and more complex, but this one shows the kind of simple picture we are aiming to develop for teachers.

This paper reports on our recent and current work in linking our items with items in national tests into an item bank, so that each item is situated in the three dimensional framework of ‘diagnostic context’, ‘content’ (i.e. national curriculum attainment target and programme of study) and ‘ability’ (i.e. level of achievement). Test items are initially written to a curriculum statement (in the attainment targets or programme of study) and this leaves us with the need to locate items on a common ‘difficulty’, ‘ability’ or achievement scale. This requires that we equate pre-tests, and this is the subject of this paper.

Measurement, item-banking and equating with the Rasch model

The Rasch model is one of a number of item response models which are being widely used nowadays for a variety of purposes including item bank-ing. The practical advantages of item-response models over Classical Test Theory (CTT) models lies precisely in their capacity to deal with missing information, and hence to link sets of items or tests through common examinees, or sets of examinees through common test items (Wright, 1977; Hambleton et al., 1991; Wright & Bell, 1984; Weiss and Yoes, 1991). An example of how this works will illustrate this.

The Rasch model seeks to assign a single number to each test item or mark, which is usually called the ‘difficulty’ of the item, and a single number to each examinee, which is usually called the ‘ability’ of the examinee. In the case of the mathematics tests the difficulty will be a measure of the difficulty of the mathematical item, and the ‘ability’ is intended to be a measure of the achieve-ment of the child in mathematics. The Rasch model assigns a probability P that a child of ability ‘a’ will correctly answer an item of difficulty ‘d’:

P = exp (a – d) / ( 1 + exp (a – d) ).

The probability increases rather like a cumulative normal function from 0 to 1 as the ability parameter increases from “minus infinity” to “plus infinity”, and is 50% when a = d.

Page 25: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

15

Thus:

a – d = loge( P/(1 – P)).

This assigns a log-odds value to ability and difficulty parameters, known as ‘logits’. This has many technical and mathematical advantages over alternatives.

Notice that the parameters ‘a’ and ‘d’ separate the characteristic of the examinee from the characteristic parameter of the item: this model suggests that the difficulty of an item rests as it were “within the item” and does not vary over different groups of children. The fact that different children will have different chances of correctly answering an item resides purely in the characteristic ‘ability’ measured by the parameter ‘a’. Thus the model insists that the relative difficulties of any particular pair of items should stay the same whatever the group of children being tested, and the relative abilities of any particular pair of children should stay the same whatever the group of items being used to test them.

Luce and Tukey established the axioms of fundamental measurement which under-lie Rasch models (Luce and Tukey, 1964, Rasch, 1966, Wright & Stone, 1982) and these are believed by Rasch modellers to be the essential requirements for scientific measurement of intervals, or ‘objective measurement’: parameter separa-bility and conjoint additivity. Thus the measuring instrument, like a ruler, must have constant intervals between marks on the scale whatever is being measured. Suppose the ruler shrank during the course of an experiment to see if a plant grows under the influence of a certain fertiliser. Now imagine an educational experiment in which the test measurement scale is different after the experiment, because the children have ‘learnt’ to answer some of the test items!

The Rasch model assumes that the estimation of difficulties of items is ‘sample free’, and the estimation of abilities of examinees be ‘bias free’. This is just the model we want because in an item bank we want to use different sets of items on different groups of children and know that, within reason, the difficulties and abilities estimated will be invariant of the particular choice.

These assumptions of the Rasch model in scaling have of course been questio-ned: it is manifestly true that items do often show bias between children of apparently the same ‘ability’ but of different ages, sex, language, class and so on. However, the use of the Rasch model, based on these assumptions, proves to be valuable in identifying such ‘bias’.

Thus the selection of a model based on a set of assumptions reflects our intention to construct unbiased and sample free tests, and our need to evaluate the fit of test data to these requirements. In this sense we can agree with Goldstein and Wood (1989)

Page 26: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

16

that item response theory can usefully be thought of as item-response modelling, and we tackle their criticisms of the Rasch and other item response models accord-ingly. They claimed that in educational contexts the uni-dimensional assumption is not valid, because parameters will be unstable between different sub-samples, over time and between schools and curriculum contexts.

The assumption of one-dimensionality and finite-dimensionality of measurement

The problem of psychometric ‘scaling’ has been a central concern of, if not syn-onymous with, educational measurement since its beginning (Angoff, 1971). If a collection of test items can be placed on a common scale, i.e. if one parameter is enough to describe each item, then the scale is ‘one dimensional’. The fundamental validity question asked of the use of all measurement models then is “can an ‘ability’ be described by one dimension?”

The classical approach to this question has been through factor analysis, and there has been considerable attention in the literature to the different strengths and weaknesses of factor analysis and item response modelling as alternatives. (Schumacker, 1996, edited an entire edition of Structural Equation Modelling to this question.) Indeed the interested reader is referred to several other studies comparing classical methods with item response theory methods, by Fan (1998) and Williams et al.(1998).

In a sense this is an empirical question. To what extent can one usefully fit the data to a one dimensional model? We use the Rasch model to evaluate this: the badness of fit of some items may be interpreted as a measure of the degree to which an item does not account for ‘mathematical achievement’ as measured by the test. A typical example is an item which requires the recall of a technical term such as ‘perimeter’ or ‘multiple’: many ‘able’ children manage to answer such questions incorrectly and this reveals a weakness on the one-dimensional modelling or, if you prefer, a weakness in the item as a measure of mathematical achievement!

Indeed, we may use the Rasch model (or in fact factor analysis, if we resort to CTT) to identify clusters of weakly fitting items: to the extent that a cluster can be interpreted as a construct, such as mental mathematical achievement, we may thereby build sub-constructs of mathematical ability, each with its own dimension. Such finite dimensional modelling is now within the capacities of Rasch modelling software (e.g. ACER’s program Conquest, Adams et al, 1998). The fundamental questioning of ‘uni-

Page 27: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

17

dimensionality’ of measurement is therefore in fact a questioning of ‘finite-dimensionality’: can we measure ‘achievement’ or ‘ability’ in mathematics in a finite number of dimensions?

We believe that the really significant criticisms of Rasch’s uni-dimensionality assumptions are philosophical and educational: they mostly apply equally well to any attempt to measure in the educational context. Fundamental (perhaps funda-mentalist) criticisms of the Rasch model assumption of uni-dimensionality there-fore raise questions about the desirability of measurement as such. The history of assessment and testing in the UK over the past ten years is enough to explain, perhaps even justify, teachers’ and researchers’ antagonism towards measurement (Dougherty, 1995; Gipps, 1994: Gipps & Murphy, 1994).

The assumption of measurability

Actually, the question of uni-dimensionality is not just an empirical question, but a question of validity of the measurement intention (see the debate between Gold-stein and Masters in Goldstein, 1995, and Masters, 1995). The danger is that we will make important what we can measure, rather than measure what is important. Gipps (1994) argues cogently for the need to develop richer forms of assessment in the interests of teaching and learning, and against the testing and measurement culture which reflects the need for accountability, reliability and ‘objectivity’. As Messick (1989) has argued, the consequential validity of an assessment requires that all the consequences of the process be considered.

While we can accept these arguments, and we are in favour of developing performance and other forms of assessment, outside of testing when necessary, we can still see the desirability of building measures which have some generali-sability and robustness over ‘time and space’. Surely generalisability over time and space means precisely that tests must be linked, sample free and unbiased against sub-populations. The current arguments in the profession both in the UK and the USA for some form of evidence-based practice seem strong (e.g. Glaser, 1997; Hammersley, 1997; Hargreaves, 1996). Teachers will need evidence to change their practice, preferably evidence which they can collect and ‘see’ for themselves. Indeed, how are teachers to protect themselves from half-baked nostrums put forward by the new school improvement experts if they cannot ask for the ‘evidence’ of their efficacy? The review of research into the underpinning policies of the National Numeracy project was a welcome departure in this regard, for it ‘comes clean’ about the lack of decisive evidence in the literature for major planks of this programme (Brown et al, 1998).

Page 28: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

18

We conclude that effective measurement is an intention: and the assumptions under-lieing the Rasch model provide us with rather demanding criteria against which to validate our measures.

Rasch and the linking method: an example of parallel test equating through anchored written tests

We illustrate with an example. A group of secondary school children took written tests in the form of various tiers: they are entered for a 4-6, 5-7 or 6-8 tier (with link items so that the 4-6 test overlaps with two-thirds of its items in common with the 5-7 test and so on) and to one of two mental tests: test A or test B. The children are awarded a level (3, 4, 5, 6, 7 or 8) based on their total score for their tier, irrespec-tive of which questions they get right in the papers they take, according to certain grade boundaries which are of course, different for each tier. Some questions arise:

1. Is a given level awarded at one tier equally as difficult as the same level at another tier?

2. Is the mark on mental test A equally as difficult as the same mark on the parallel mental test B?

The problem is to equate all the items’ difficulties on the various tests and all the children’s abilities: i.e. the problem is to ‘bank’ all the items. A classical test theory (CTT) approach to this might be to use the common items to anchor the test results by equipercentile or raw score regression equating (Angoff, 1971). This is easier said than done: Kiek (1998) has done such a study with data from the 1998 national mathematics test: he assumes the samples which took the two mental tests are equivalent and equates the two scores by equipercentiles, though the same assumptions will allow regression and this would probably minimise the errors to be encoun-tered with such equation. The alternative classical test theory (CTT) techni-que is to equate via the anchoring written tests (Lamprianou, 1999) which takes account of the possible differences in ability of the two samples. But then which written test will one use to equate? There are three possible different alternatives, and it appears each will give a different equation unless the written tests are themselves somehow vertically equated! Thus it seems that CTT techniques will do the job but only under the same essenti-al measurement assumptions which underlie the Rasch model.

The Rasch method can be approached in a number of different ways: a number of scales can be separately estimated and item difficulties examined, then separate scales can be linked using the common items which have

Page 29: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

19

essentially the same difficulty estimates. Alternatively, a single scale can be estimated, assuming that all the children took ‘one’ test, but were ‘absent’ from parts of it and so have missing data.

Taking the latter course, all the data on children’s responses to all the items which arise in any of the tests is constructed in one block of text, with missing data everywhere a child has not taken an item. The computer pro-gram QUEST (Adams & Khoo, 1996) estimates the parameters for each child and item in order to minimise the sums of squares of the differences between the data and the model in a ‘maximum likelihood estimation’. Where data is missing there is no difference contributing to the sum. (This account is a little simplified, since in some data-sets the responses involve partial credit and a somewhat more elaborated ‘partial credit’ model is used, in which each mark point within the item is given its own threshold: see Masters, 1982).

The fit of the data to the model can now be examined in a number of ways: the fit of each item (or child) to the model can be calculated, summing the squares of differences between the model and the actual data for that item (or child); these can be weighted (infit) or unweighted (outfit) and standar-dised to give a t-value, and these suggest when an item (or child) significant-ly misfits the model. (See Smith, 1991, 1996; Smith et al, 1998). The possibilities for examining groups of misfitting items have already been mentioned (these may suggest the construction of new dimensions worthy of reporting). To this we can add the possibility of identifying groups of misfitting children. These may be groups identified a priori, such as by sex and language in the identification of group or item ‘bias’ (often called differential item functioning by test developers with an eye to political and legal considerations), or they may be identifiable only post-hoc. Person misfit may be used to identify groups, classes or schools performing signifi-cantly well or poorly.

The result is a set of parameters for abilities and difficulties, and these model the probability of each child’s response to each item: Table 1 shows the probability that a child of ability 0.5 and 1.5 respectively answers each question from test A and test B correctly. The sum of the marks expected of these abilities are therefore 19.11 and 23.73 for test A and 19.19 and 23.75 for test B: indicating that test A is about 0.08 and 0.02 marks easier than test B respectively (no doubt the Minister will be worried!).

Page 30: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

20

Table 1

Pupil ability = Pupil ability =

0.50 1.50 0.50 1.50

TEST A TEST B

mark logit prob correct mark logit prob correct

1 -2.72 0.96 0.99 1 -2.83 0.97 0.99 2 -1.00 0.82 0.92 2 -0.72 0.77 0.90 3 -0.89 0.80 0.92 3 -1.10 0.83 0.93 4 -0.46 0.72 0.88 4 -0.23 0.67 0.85 5 -1.92 0.92 0.97 5 -1.75 0.90 0.96 6 -2.36 0.95 0.98 6 -2.22 0.94 0.98 7 -1.48 0.88 0.95 7 -1.53 0.88 0.95 8 -1.09 0.83 0.93 8 -1.18 0.84 0.94 9 -1.20 0.85 0.94 9 -1.04 0.82 0.93 10 -0.74 0.78 0.90 10 -0.86 0.80 0.91 11 -0.38 0.71 0.87 11 -0.20 0.67 0.85 12 -0.38 0.71 0.87 12 -0.24 0.68 0.85 13 -0.19 0.67 0.84 13 -0.29 0.69 0.86 14 0.01 0.62 0.82 14 -0.07 0.64 0.83 15 0.37 0.53 0.76 15 0.48 0.50 0.73 16 0.62 0.47 0.71 16 0.82 0.42 0.66 17 1.76 0.22 0.44 17 1.47 0.27 0.51 18 1.80 0.21 0.43 18 1.88 0.20 0.41 19 -0.30 0.69 0.86 19 -0.86 0.80 0.91 20 -0.78 0.78 0.91 20 -1.01 0.82 0.92 21 -0.68 0.76 0.90 21 -1.19 0.84 0.94 22 -0.25 0.68 0.85 22 -0.10 0.65 0.83 23 -0.70 0.77 0.90 23 -0.91 0.80 0.92 24 -0.30 0.69 0.86 24 -0.45 0.72 0.88 25 0.26 0.56 0.78 25 0.42 0.52 0.75 26 0.69 0.45 0.69 26 0.68 0.46 0.69 27 0.42 0.52 0.75 27 0.38 0.53 0.75 28 1.86 0.20 0.41 28 1.69 0.23 0.45 29 1.63 0.24 0.47 29 1.82 0.21 0.42 30 2.48 0.12 0.27 30 2.62 0.11 0.25 Marks

expected =

19.11

23.73

Marks

expected =

19.19

23.75

Page 31: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

21

But more significantly, the set of logits for each mental question in test A and test B has been registered on one scale, i.e. they have potentially entered our item bank. The ‘only’ difficulty may be the question of model-fit: for instance one of the mental test items in test B has a misfit over 1.2, which according to custom and practice is regarded as substantial, and certainly warrants attention in the context of national high-stakes testing. Assuming the items in the bank fit the model, we can predict how children of different abilities will behave on different sets of items drawn from the bank.

Vertical equation across tiers: an example of item-linking over a ‘common curriculum’

We illustrate the vertical equation of tiered tests in two recent years in Figures 2.1 and 2.2. Taken at face value, it seems that the difficulty of achieving a level has been somewhat easier, in the past, for the children entering the higher tiers. This could be justified in various ways: schools are thereby encouraged to enter children at a higher level, or children at lower levels should be expected to show a higher degree of mastery on the common (link, and hence easier) material on which their curriculum may have focussed.

However an alternative explanation might be that the standard setting exercise used (a variation of the Angoff method described in Morrison et al, 1994) is based on a different construction of difficulty and ability. Whether this method has its own validity, its reliability as a method in educational assessment has been called into question by Impara and Plake (1998) who found that teachers were inaccurate in their estimates of their children’s performance on items.

However, the validity of these results can also be questioned: if the items are scaled separately for the groups at each tier, can we assume the item-estimates are stable for the various sub-groups? Such a question is essentially probing the validity of the Rasch model in this context. But it is equally probing the validity of vertically equating the tiers at all: have the children entered for different tiers really followed the same curriculum? If not, is it valid to assign a level 5 to children who have followed two different curric-ula? These questions focus attention even more sharply on our next exam-ple: linking by primary and secondary tests.

Page 32: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

22

9 5

9 0

8 5

8 07 57 0

6 56 05 5

5 04 5

4 0

3 5

3 0

9 5

9 0

8 5

8 0

7 5

7 06 5

6 0

5 55 0

4 5

4 0

3 5

3 0

9 5

9 0

8 5

8 0

7 5

7 0

6 56 0

5 5

5 0

4 5

4 0

3 5

3 0

2 7

2 7

4

3

2

1

0

- 1

- 2

- 3

- 4

2 9

1 0 5

1 0 0

9 5

9 0

8 5

8 07 5

7 06 56 05 55 0

4 54 0

3 5

3 0

Figure 2.1 Vertical equating of scores from different tiers of written tests in one recent

year.

Page 33: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

23

4

3

2

1

0

959085

807570656055

50

45

40

35

959085807570656055

5045

40

35

32

959085807570656055504540

37

Figure 2.2 Vertical equating of scores from different tiers of a test in one recent year,

using scales including the mental test.

Page 34: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

24

The vertical equation of two Stages: item-linking of ‘different’ curricula

Drawing on the data from our own pre-tests, we construct scales for the pri-mary (PRIM) and secondary (SEC) stage children separately, and consider the differential item functioning (DIF) of the common linked items between these two sub-populations: see Figures 3.1 and 3.2 below.

PRIM - SEC Item Link

y = 0,9053x - 1,4574R2 = 0,8649

-6

-5

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

PRIM

SEC

Figure 3.1

Of the 400 common marks, we find that we have a statistically significant difference between PRIM and SEC stage performance on approximately half the marks. We notice that statistical significance is achieved by items whose logit values vary across PRIM and SEC by more than 0.2. This is about one-fifth of a ‘level’, and so a rather small difference educationally (i.e. correspond-ing to about 5 months of standardised development).

Examining the regression equations, we conclude that the logit-conversion equation should be approximately SEC = PRIM +1.4 and PRIM = SEC - 1.4. This now allows us technically to combine the SEC and PRIM items into one bank. Note that the regression equation optimises the prediction of difficulty of items in one sub-population from its difficulty in the other (the least squares

Page 35: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

25

method) but the equation requires a symmetric function rather as in CTT the equipercentile equation of Angoff equation provides.

However there are some technical issues: should we exclude from the equation items which perform most significantly differently across the sub-populations? If we do this, we will be distorting the equation by excluding such items.

SEC - PRIM Item Link

y = 0,9554x + 1,3869R2 = 0,8649

-5

-4

-3

-2

-1

0

1

2

3

4

5

6

-6 -5 -4 -3 -2 -1 0 1 2 3

SEC

PRIM

Figure 3.2

Actually this is not just a technical issue: it hides a curriculum issue. The items which one would exclude from the equation are those which relatively favour one group over the other, i.e. the items with sub-population DIF. Placing all the items and all relevant children into a DIF estimation reveals the item DIF in just the same way as in language or sex DIF. We examine the items which differ significantly across sub-populations and find:

• items which favour the SEC stage include richly contextual and verbal or cultural contexts: this includes some of the interpretation of graphs, verbal problem solving and reading of scales and tables, indicating a cultural or maturational effect;

• items which involve algebra or probability tend to favour the SEC stage: indicating a curriculum effect; and

Page 36: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

26

• items which favour the PRIM stage include ‘bare number’ such as times tables facts, place value and some straightforward applied arith-metic problems.

There are some differences which appear to be anomalies which we cannot explain. However, one must characterise groups of items in this situation: since the model is probabilistic one expects to find (at the 0.05 level) about 5% of items ‘significantly’ misfitting!

Longitudinal linking of one year to the next: examples of item-linking and person-linking

So far we have collected two kinds of data in attempts to link one year’s data with the next. By estimating the item difficulties year-on-year, we were able to regress sets of item estimates: in Figure 4 we illustrate the regression of the 13 link items (assessing levels 2 and 3). This provides a reasonable R-squared and suggests an equation approximately:

Pretest logit = Previous year logit + 1.9

Such a formula allows us to estimate on our pretest the cut score equivalent to the previous year’s cut score, and to estimate the difficulty of each previous year item as compared to the pretest items.

Previous year - pretest 1 link (test 2-3)y = 1,0483x + 1,8898

R2 = 0,7638

-3-2

-1

0

1

23

-4 -3 -2 -1 0

Previous year logits

pret

est l

ogits

Figure 4. The performance of common link items from the previous year used in a

pretest.

Thus we could identify items whose performance fit well enough and examine items which did not fit well. This methodology exactly parallels that of linked tiers

Page 37: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

27

and sub-populations (PRIM and SEC stages). Indeed we have fewer reservations about differences in curriculum from one year to the next than from one school stage (PRIM/SEC) to the next, although the change in the assessment of the cur-riculum to include the assessment of mental mathematics might be thought a major curriculum shift with backwash effects on longitudinal data. Apart from this, the drift in the curriculum has been gradual in recent years and one would not expect this to cause problems in comparisons of nearly adjacent years.

Second, we linked end of year test data against our pretest by collecting data on the pretest children’s end of year test data. Collecting the children’s raw score on each test, we are able to regress the children’s raw scores or various transformations of these scores (e.g. logit = log(rawscore/100-rawscore)) for the pretest on the national test score. This provides a mapping from our scale to the following year’s scale (again estimated from the School Sampling Project database by Rasch scale estima-tion). Hence we have an estimation of every item difficulty on the same scale as every previous item difficulty. In principle, though we have not yet completed this in practice, we have a single bank of both years’ test items, and on this single scale we have the various grade boundaries represented as ability estimates.

But examining the regression of abilities estimated for the children, we note several difficulties in this person-linking of scales:

1. The regression of person-estimates on two different tests shows a degree of unreliability in individual estimates (rather than means). In fact this unreliability is of similar order to the unreliability of different parts of the tests, and represents a challenge to the Rasch model. On the whole it seems that there is more standard error in person estimates (on a 100 mark test, say) than in item estimates (on a sample of 1000 children, say).

2. While one can examine items considered for exclusion as outliers, and consider the curriculum bias that such exclusion introduces into the anchorage, the same cannot be said of the excluded children!

Conclusions and discussion

The Rasch method provides one model of measurement that is particularly useful in work on item banking and hence measurement of change over time, or over parallel or otherwise equated test forms. The validity of the equating and banking is always under threat, but we argue that the Rasch model provides a convenient, simple and parsimonious tool for identifying the size and shape of this threat. Thus, we were able to examine the parallel

Page 38: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

28

equation of the two mental tests and pick out the possible (though trivial, as it turns out) problem of item misfit threatening the validity of equation. We then looked at vertical equation and considered the equity of cut scores matched to ability estimates, but had to consider the possibility that cur-riculum effects threaten the validity of vertical scaling. The Rasch scaling provides a warning, but not a definitive judgement.

In equating across different school sub-populations PRIM/SEC (years), we are presented with differential item functioning (DIF) which suggests the need for thought about possible differential school stage scales, reflecting maturational, as well as curriculum differences. Although the common items lead to possible equa-tion of levels and item difficulties, there may be a danger in ignoring the change in programme of study which seems to have led to some shift in relative item per-formance. One solution, the exclusion of differential items from the equation, is worth considering. But one should not proceed with this lightly, for the same reason one should not thoughtlessly exclude items which misfit from a scale. The model has again thrown up food for thought: what is it about the stages of the curriculum and assessment which is different, how educationally significant are these results and what is the scale of distortion involved in equating them?

Finally we considered the example of equating over time, by utilising linked common items or common persons. The former is definitely a preferable method for the reasons given, and the curriculum effect of change over time was thought to be, in normal times, not too burdensome. We intend to keep a close watch on the evolution of mental test performance in particular however, and if hoped for changes arise from the National Numeracy Strategy have an impact we imagine the examination of arithmetic subscales will be productive.

References Adams, R. J. & Khoo, S-T. (1996). Quest: the interactive test analysis system

(Melbourne, Australian Council for Educational Research).

Adams, R. J. , Wu, M. L. & Wilson, M. R. (1998). ConQuest: Generalised item response modelling software (Melbourne, Australian Council for Educational Research).

Angoff, W. H. (1971). Scales, norms and equivalent scores, in: R. L. Thorndike, (ed.) Educational measurement (Washington D.C., ACE).

Bell, A. W., Swan, M., Onslow, B., Pratt, K., & Purdy, D. (1985). Diagnostic teaching: teaching for lifelong learning (University of Nottingham, Shell Centre for Mathematical Education).

Page 39: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

29

Black, P. & Wiliam, D. (1998). Assessment and classroom learning, Assessment in Education, 5 (1), pp. 7-74.

Brown, M., Askew, M., Baker, D., Denvir, H., & Millet, A. (1998). Is the national numeracy strategy research based? British Journal of Educational Studies, 46(4), pp. 362-385.

Christie, T. & Boyle, W. (1994). Mathematics cross key stage comparability. (University of Manchester, CFAS/SCAA).

Cooper, B. (1994). Authentic testing in mathematics? The boundary between everyday and mathematical knowledge in National Curriculum testing in English schools, Assessment in Education, 1(2), pp.143-166.

Dassa, C., Vazquez-Abad, J. & Ajar, D. (1993). Formative assessment in a classroom setting: from practice to computer innovations, The Alberta Journal of Educational Research, 39 (1), pp. 111-125.

Daugherty, R. (1995). National curriculum assessment, a review of policy 1987-1994 (London, Falmer).

Doig, B. A., Fox, P., Ryan, J. T. & Williams, J. S. (1997). Understanding children’s mathematics to raise standards: 1997 response analysis of key stage 1 Mathematics test (University of Manchester, Mechanics in Action Project).

Fan, X.(1998) Item response theory and classical test theory: an empirical comparison of their item comparison/person statistics, Educational and Psychological Measurement, 58 (3) pp. 357-381.

Fox, P. P., Ryan, J. T. & Williams, J. S. (1997). Understanding children’s mathematics to raise standards: key stage 3 error analysis, 1997 (University of Manchester: Mechanics in Action Project).

Gipps C. V. (1994). Beyond testing: towards a theory of educational assessment, (London, Falmer).

Gipps, C. V. & Goldstein, H. (1983). Monitoring children: an evaluation of the Assessment of Performance Unit (London, Heinemann).

Gipps, C. V. & Murphy, P. (1994). A fair test? (Milton Keynes, Open University).

Glaser, R., Lieberman, A. & Anderson, R. (1997). ‘The vision thing’: Educational research and AERA in the 21st century part 3: perspectives on the research-practice relationship, Educational Researcher, 26, 7, pp.24 - 25.

Goldstein, H. & Wood, R. (1989). Five decades of item response modelling, British Journal of Mathematical and Statistical Psychology,42 (2), pp. 139-167.

Goldstein, H. (1995). Interpreting international standards of student achievement: Educational studies and documents, 63 (Paris, UNESCO).

Hambleton, R. K., Swamanithan, H. & Rogers, H. J. (1991). Fundamentals of item response theory (Newbury Park, CA, Sage).

Page 40: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

30

Hammersley, M. (1997). Educational research and teaching: a response to David Hargreaves’ TTA lecture, British Educational Research Journal, 23, 2, pp. 141 - 161.

Hargreaves, D. H. (1996). Teaching as a research-based profession: possibilities and prospects, Teacher Training Agency annual lecture (London, Teacher Training Agency).

Harlen, W. & James, M. (1997). Assessment and learning: differences and relationships between formative and summative assessment, Assessment in Education,4 (3), pp.365-379.

Impara, J. C. and Plake, B. S. (1998). Teachers’ ability to estimate item difficulty: a test of the assumptions in the Angoff standard setting method, Journal of Educational Measurement, 35(1), pp. 69-81.

Kellaghan, T., Madaus, G. F., & Airasian, P. W. (1982). The effects of standardised testing (Boston, Kluwer-Nijhoff).

Lamprianou, I. (1999). The Comparative use of classical test theory and Rasch analysis in test equation. Unpublished M. Ed. thesis, University of Manchester.

Linn, R. L. (1989). (Ed.) Educational Measurement (3rd. Edition) (London: Collier Macmillan).

Luce, R. D. and Tukey, J. W. (1964). Simultaneous conjoint measurement: A new type of fundamental measurement. Journal of Mathematical Psychology, 1, 1-27.

Masters, G. N. (1988). Partial credit model, in: J. P. Keeves (Ed.) (1988) Educational research, methodology and measurement (Oxford, Pergamon)

Masters, G. N. (1995). Scaling and aggregation in IEA studies: critique of Professor Goldstein’s paper: Educational studies and documents, 63 (Paris, UNESCO).

Messick, S. (1989). Validity, in: R. L. Linn (Ed.) Educational Measurement (3rd. Edition), pp. 12 - 103. (London: Collier Macmillan).

Morrison, H. G., Busch, J. C. & D’Arcy, J. (1994). Setting reliable National Curriculum standards: a guide to the Angoff procedure Assessment in Education, 1(2), pp.181-199.

Rasch, G. (1966). An item analysis which takes individual differences into account. British Journal of Mathematical and Statistical Psychology,19, 49-57.

Ryan, J. T., Williams, J. S. & Doig, B. A. (1998). National tests: educating teachers about their children’s mathematical thinking, in: A. Olivier & K. Newstead (Eds.) Proceedings of the 22nd conference of the International Group for the Psychology of Mathematics Education, vol. 4, pp. 81-88. (PME-22) (South Africa, University of Stellenbosch).

Page 41: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

31

Schmidt, W. H., Jorde, D., Cogan, L. S., Barrier, E., Gonzalo, I., Moser, U., Shimuzu, K., Sawada, T., Valverde, G. A., Mcknight, C., Prawat, R. S., Wiley, D. E., Raizen, S. A., Britton, E. D. & Wolfe, R. G. (1996). Characterising pedagogical flow: An investigation of mathematics and Science Teaching in Six Countries (Dordrecht, Kluwer).

Schumacker, R. E. (Editor, 1996). Structural Equation Modelling, 3(1).

Sebatane, E. M. (1998). A response to Black and Wiliam, Assessment in Education, 5(1), pp. 123-130.

Smith, R.M. (1991). The distributional properties of Rasch item fit statistics. Educational and Psychological Measurement, 51, pp. 541-565.

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement, Structural Equation Modelling, 3(1) pp. 25-40.

Smith, R. M., Schumacker, R. E. & Bush, M. J. (1998). Using item mean squares to evaluate fit to the Rasch model, Journal of Outcome Measurement, 2(1) pp. 66-78.

Weiss, D. J. & Yoes, M. E. (1991). Item response theory, in: R. K. Hambleton, & J. N. Zaal, (Eds.) Advances in educational and psychological testing (Boston, Kluwer).

Williams, J. S. & Ryan, J. T. (2000). National testing and the improvement of classroom teaching: can they coexist? British Educational Research Journal, 26(1), pp. 49-73.

Williams, V. S. L., Pommerich, M. and Thissen, D. (1998). A comparison of developmental scales based on Thurston methods and item response theory, Journal of Educational Measurement, 35(2), pp. 93-107.

Wright, B. D. & Bell, S. (1984). Item banks: what, why, how? Journal of Educational Measurement, 21(4), pp. 331-345.

Wright, B. D. & Linacre, M. (1991) BIGSTEPS: Rasch analysis for all two-facet models (Chicago, MESA Press).

Wright, B. D. & Masters, G. N. (1982) Rating scale analysis (Chicago, MESA).

Wright, B. D. & Stone, M. H. (1979) Best test design (Chicago, MESA).

Wright, B.D. (1977) Solving measurement problems with the Rasch model, Journal of Educational Measurement, 14, pp. 97-116.

Page 42: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

32

Page 43: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

33

Encouraging Excellence through Assessment: Mathematics Examinations in Israel Miriam Amit, Ben Gurion University of the Negev

Introduction - National Completion Examinations (NCE)

Reforms in mathematics education – in curricula and in reorganizing teaching methods – are a result of changes in the mathematical community’s concepts of “what is important and what is not” in mathematics, as well as changes within the society as a whole (Amit & Fried, 2002). However, reforms in the structure of the national completion examinations in mathematics (NCEM) are often due to political and social factors rather than to the subject matter.

The problem of final examinations in general is very dynamic, as explained by Friedman & Ben-Galim (1996):

Changes in topics and exam structures take place frequently in many countries. Some countries perform significant changes while other coun-tries cancel changes that were previously made. The basis for change lies in the will to adjust these exams to new learning methods that, them-selves, change frequently as a result of rapid social, economical and tech-nological advancements. The unconcealed argument for changes in the NCE leans on the necessity to provide explanations for the growing stu-dent population, increasing numbers of school dropouts and aproper an-swer to the “global village” where there is no information screening. In many countries, there is an emphasis on the growing need for excellent education. Students are grasped as the major component of “human wealth” and should therefore, be treated as such.

NCE usually are designed to achieve one or more of the following goals: analyz-ing students’ knowledge upon completion of high school, admission to the university or any other higher education facility, evaluation for employment purposes, and a means of indirectly influencing teaching methods and school curricula. In Israel, the place of the NCE is rather prominent, as stated by Yaakov Kopp, director of the Center for the Research of Social Policy in Israel:

The Bagrut examinations fulfill two functions in Israel: the one—as a means of assessing the scholastic achievements of high school graduates, and the

Page 44: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

34

other—as a key opening the door to the next stage of the educational sys-tem—higher education. The second function of the examinations is, in fact, not merely educational but clearly social as well, determining not to a small degree the socioeconomic future of the individual (in Levy, 1995, preface)

The Israeli NCE comprises a very wide range of subjects. The obligatory sub-jects are: mathematics, civics, English, Hebrew (or Arabic for the Arab stu-dents), history, literature and Bible (Koran for Moslem students). Failure to pass or to take any one of the above subjects, prevents the student from obtain-ing a Bagrut (which means Maturity in Hebrew) certificate- a certificate, which is equivalent to a high school diploma and is a minimum prerequisite for continuing studies at a university level (Amit & Fried, 2002). In other words, obtaining a Bagrut certificate determines the student’s future career and status.

Having such a crucial influence upon a student’s life, the country must provide an equal opportunity for all to succeed in the NCE and assure their future education and employment, regardless of their educational and personal background.

The uniformity of the NCE ensures an objective approach. If, for example, a student’s admission to the university had been determined only by his school’s grades, universities would have treated peripheral schools differently than prestig-ious ones. Students from the periphery would have had a double disadvantage: being taught by inexperienced and perhaps second-rated teachers, and their grades being rated differently. It is well known that some countries rate their high schools and their grades are weighted accordingly. Had this system been in effect in Israel, it would have caused a fixation of the country's social structure. Uniformity in the NCE (content, structure, time, grading, scoring and informing) was created in order to eliminate educational disadvantages derived from socio-economic back-ground, thus opening higher education to all. In other words, in order to democra-tize education for all and not only for the elite.

One way to ensure equal opportunity in obtaining the Bagrut is to reduce the level of requirements. Keeping in mind the impact of Bagrut on the educational system, reducing the demands of the NCE would automatically cause a drastic drop in the level of teaching, thus causing enormous damage. Thus the excel-lent and highly motivated students would suffer instead of being rewarded for their skills. Another way to help students reach the finishing line is by providing them with good learning techniques and an assessment system that will enable them to utilize their full potential at their own pace and on their terms. This is the right way to reach educational and mathematical democracy without losing excellence. This aim has led to a two-stage reform in NCE in Mathematics.

Page 45: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

35

Before proceeding, here are a few Figures and some information about the Israeli educational system.

Background on Israel:

• Population:6.8 million

• Student population: 2.2 million Compulsory education between the ages 5-15 (more than 90% study until the age of 18)

• Public education only (no private schools)

School structure and assessment of student achievement:

• Primary school – grades k-6, internal assessment

• Middle school – grades 7-9, internal assessment

• Secondary school - grades 10-12, internal+ external NCE-Bagrut

The NCE Bagrut tests are paper-and-pencil based tests, taken by high school students towards the end of high school, usually in the11-12th grades. The Bagrut* certificate includes obligatory subjects as well as electives, as illustrated below:

Obligatory Elective

Mathematics

Hebrew (Arabic)

English

Civics

Literature

Bible (Koran)

Natural sciences

Art

Social sciences

Languages

A short research project

There are usually three levels of difficulty for each exam: ordinary, advanced, and highly advanced (for some subjects there are only two levels).

Eligibility for a Bagrut certificate requires a passing grade in:

• All the obligatory topics.

• At least one obligatory topic at the advanced level

• At least one elective topic

Page 46: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

36

Mathematics was, and still is the major obstacle to receiving a Bagrut certificate. The aim of the reform in this structure described herewith is to enable a greater number of students to overcome this obstacle, but not at the expense of excellence.

* The word bagrut means “maturity” in Hebrew. The NCE leading to the Bagrut certificate, the Bagrut examination, is the counterpart of the German word Maturitatsprufung.

The National Completion Examination in Mathematics - NCEM

The Israeli national curriculum in mathematics is composed of three levels: ordinary, advanced and highly advanced. These levels differ in two dimensions: the range of mathematical topics and the depth these topics are taught, and in the teaching approach. For example, there are differences in the:

• Amount of symbolism verses verbalism,

• Variety of representations,

• Amount of intuitive reasoning versus rigorous mathematics, and

• Modes of justification and argumentation

(Based on Amit and Fried, 2002).

The topics of each level are incorporated in the higher one. The following core topics appear in all levels:

• Algebra (including algebraic equations, systems of equations, graphs, sequences and their applications);

• Geometry (2 and 3 dimensional) and analytic geometry;

• Trigonometry (including some geometric applications);

• Probability and statistics (basic concepts);

• Basic ideas of calculus (differential and integral).

Topics added to the advanced level:

• Vectors

• Spaces

• Mathematical induction

Page 47: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

37

Topics added to the highly advanced level:

• Combinatorics

• Complex numbers

The two dimensional structure is illustrated in Figure. 1

( based on Amit and Fried, 2002).

The Pre-reform Structure of the NCEM

Until 1997 the NCEM was composed of three distinct questionnaires, one for each level, as illustrated below:

In the pre-reform structure, it was impossible to move from one level to another without retaking the whole exam. Moreover, most students were directed towards a certain level at the end of 9th grade. There was hardly any mobility in later years, based on a pedagogical approach called “the best placement for the

Page 48: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

38

student ”. This approach determined, in many cases the student’s future at the age of 14(!), since there was a linkage between the level of the exam and the student’s post secondary and university studies:

• Ordinary level: minimum requirement for the Bagrut certificate.

• Advanced level: minimum requirement for admission to some social sciences (e.g. economics, psychology) and most of the natural sciences.

• Highly advanced level: requirement for computer sciences, physics, en-gineering, medicine.

Some data:

According to data supplied by the Ministry of Education, the distribution of examinees in the three levels showed that in between 1992-1997, 50 - 55% of the students that took the exams were placed in the ordinary level, 27-30% in the advanced and 20-23% in the highly advanced level. However, thousands of students weren’t even registered for the exams, meaning, that they refrained from taking the ordinary level exam. They were, therefore, excluded from the above statistics. The reasons students haven’t registered to take the exams are mathematical anxiety and fear of failing. Those who took the exams at the ordinary level, succeeded in most cases. For example, in 1995 the percentage of students passing the exam was 83.47% and in 1997, 83.35%.

First stage of reform :a new structure for the Ordinary Level

The problem:

The main obstacle to obtaining the Bagrut certificate was the NCEM due to the reasons described above. Israeli society decided that there was a need for change and a need to increase the number of students being tested and eligible to receive the Bagrut certificate.

The reform had social, educational and personal goals: Social:

*Increasing the number of students eligible for Bagrut, thus enabling them a better start in life.

Page 49: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

39

Educational:

• Promoting mathematical literacy for a greater number of students.

• Adjustment of teaching methods to the needs of a wider student popu-lation.

Personal:

• Bagrut certificate is a door opener to higher education.

• Increasing self esteem.

• Success in mathematics may well affect and increase success in other subjects.

The solutions in reform I were:

• Building a curriculum linked to everyday experience, so that students will be able to use a variety of abilities and draw conclusions in and from mathematics.

• Dividing the Ordinary Questionnaire into two parts: A basic questionnaire and a supplementary questionnaire.

Each questionnaire can be tackled individually, according to the student’s pace. The student can study and master mathematics in “small portions”. However, the major difference isn’t in the structure (number) of question-naires but rather in the substance and approach. The questions in the Basic Questionnaire are designed according to the requirements of basic mathe-matical literacy in mind. The subject matter includes algebra (linear and quadratic equations, graphs, some analytic geometry and sequences and series), trigonometry and statistics and probability. The questions allow place for mathematical intuition, common (mathematical) sense and every-day experience, and require relatively simple numerical calculations. Thus, the Basic Questionnaire allows students a broader range of possibilities for expressing their mathematical knowledge. A new “bank of items” (of about 1000 items) was formed and distributed to students and teachers all over the country. The questionnaires were based, with minor changes, on these items. Naturally, using the “bank of items” reduced the level of anxiety.

Herewith are a few examples of questions from the Basic Questionnaire:

Page 50: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

40

Example 1:

Example 2:

Page 51: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

41

Example 3:

(Based on Amit & Fried, 2002).

Between 1997 and 2002 there was a permanent increase in the number of examinees taking the ordinary level of the NCEM. The percentage of students taking the Basic Questionnaire rose approximately by 33% and those continuing to further question-naires rose approximately by 21%. The overall number of students eligible to receive the Bagrut certificate (from the whole population) rose by 8% and this may be related, to a certain extent, to reform (I) (Data received from the Ministry of Educa-tion. Final statistical analysis to be published in September 2004). Qualitative research shows that students and teachers are willing to make extra efforts and are highly motivated, since there is a better chance to succeed, as one interviewee said: “now there is an intermediate step which helps to reach the highest hurdle”.

Success of Reform I in the NCEM paved the road to the next stage.

Second Stage of reform ; A modular structure for the NCEM

The problem:

Number of students taking the high level mathematics examination was unsatis-factory.

Page 52: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

42

The goals of the second stage of the reform were social, educational and personal.

Social:

• To satisfy the demand from the hi-tech and engineering industries for a greater number of math oriented workers.

• To enable better social mobility as a result of higher grades.

• To enhance the country’s international scientific status.

Educational:

• To expose more students to advanced mathematics.

• To provide students with better skills suitable to the technological era.

• To improve pedagogical resources and improve the curriculum and teaching methods.

Personal: ·

• Enable students with learning deficiencies to realize their potential and further develop their talents.

• Fulfill personal potential.

• Increase self esteem

• Encourage ambition for excellence

• Reach a higher stage without risking earlier achievements.

The solution in Reform II - a “win-win “ situation.

• Creates a continuous learning program from the lower to the upper level.

• Creates a curriculum with a modular structure, with overlapping topics between consecutive levels.

• Creates a modular structure of examinations, with joint questionnaires for two consecutives levels.

• Enables the accumulation of test grades from one level to the next (or former) one.

Page 53: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

43

• Enables three different entry and exit points into and from the pro-gram.

Some Constrains in performing the solutions:

• The process was complicated since there was a need to comply with academic and math educators’ constraints.

• The need to assure the study of “important mathematics” at each level

• Preserving the present entrance level into the

• Bagrut program- in order to encourage low achievers.

• Preserving the highly advanced level - in order to challenge high achievers.

• Creates meaningful “overlapping” between consecutive levels.

#1 #2 #3 #4 #5 #6 #7

Ordinary

Basic

Ordinary

Supple- mentary

Ordinary

Supple- mentary

Advanced

Advanced

Highly

Advanced

Advanced

Highly

Advanced

Highly

Advanced

Entry point I

Entry point II

Entry point III

Figure. 2 The modular structure

This new approach enables three entrance and three exit points.

Since 2003 seven questionnaires were composed, three for each level. The ordinary level included questionnaires #1,#2 & #3.

The advanced level included questionnaires #3,#4, & #5 and the highly advanced level included questionnaires #5,#6, & #7. In contrast to the previous situation in which there were distinct questionnaires for each level, in the present case there was overlapping. The third questionnaire in the ordinary level (#3) overlaps the first one in the advanced level. The third

Page 54: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

44

questionnaire in the advanced level (#5) overlaps the first questionnaire in the highly advanced level. It should be emphasized that different students have different approaches to the “overlapping” questionnaires, based on their personal mathematical history. For example, students of the ordinary level usually find the first advanced questionnaire (#3) to be difficult, whereas those of the advanced level find the same questionnaire easy.

Although the questionnaires comprise a continuum, a student is not obliged to take all seven tests (starting from the first one.) Such an approach would be a waste of time and effort for talented or motivated students, and could have an anti-challenging affect in the long run. The NCEM are structured in such a way that different students can choose three entry points, again, based on their personal history.

An excellent and able student may begin his studies at point III in the highly advanced level and complete his exams with distinction in a short time. An-other student, who has motivation and self- confidence but doesn’t excel, may begin his studies at the advanced level starting at point II, progress and take all the advanced level exams. However, since he has successfully taken question-naire #5 which is also the first questionnaire in the highly advanced level, he has undertaken the first step and earned 33% of the highly advanced tests. The same can happen between the ordinary and advanced levels. Astudent starting with the ordinary level questionnaire may succeed in the advanced question-naire (#3).That may increase his motivation to continue to the advanced level. It should be noted that a student may take his/her exams a number of times after the completion of high school.

The modular structure has psychological and pedagogical implications. Pedagogically, teachers have to comply with changing conditions and de-velop a “dynamic pedagogy” adjusted to a greater variety of students’ abili-ties. Teachers need to balance between rigorous mathematics and intuition, concrete versus abstract, etc. Such pedagogy enables students to further develop their skills and help them to move from one level to the next. Psychologically, success in a high level questionnaire enhances students self- confidence. More so, if the student receives a good grade in the advanced level he/she is already on the course to success. For example, an ordinary level student succeeding in the joint questionnaire (#3) may wish to try for a higher level of excellency, especially when there is no risk in this venture. The student has already earned his Bagrut certificate.

Page 55: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

45

Teachers’ and students’ support

The curriculum was re-constructed to fit the above mentioned changes in the questionnaires. The topics were not changed but there was addition of more statistics and probability to the advanced and highly advanced levels and the removal of vectors from the advanced level. All teachers and stu-dents received the re-constructed curriculum and a detailed table of the complete curriculum for each level according to specific questionnaires. (See appendix, please note the joint questionnaires). Moreover, teachers received detailed and explicit explanations regarding the scope and depth for each level. However, experience shows that mere explanations, explicit as they were, did not answer the teachers’ needs. For more clarification, a file with examples was sent to the teachers. Below are some examples of problems from the overlapping areas and questionnaires.

Example 1: Joint questionnaire- ordinary and advanced level

Topic: Differential Calculus

Page 56: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

46

Example 2: Joint questionnaire- advanced and highly advanced level

Topic : Euclidian Geometry

Preliminary implications

There was an awareness that the new structure would create some difficulties: reorganizing classes, moving away from the previous streaming method, trans-ferring to a flexible schedule so that students may freely move from one level to another, and mostly, changing the teachers’ perception of what students are able to do and how to optimize their ability.

Reforms succeed within a system as a whole (Amit & Fried, 2002), hence, a support system which included seminars, conventions, internet connections and work groups was formed.

The new structure was first implemented in 2003-2004.

Based on previous success, we believe that this reform will be well accepted. Each student will have a way to develop his/her potential to the outmost extent, to aim towards excellency and form a basis for better social integration in the future. The two stage reform in assessment provides a better opportunity for a larger range of students not only to earn the Bagrut certificate but also to

Page 57: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

47

achieve an excellent Bagrut that will open doors to further studies in desired faculties. For these students it will be a win-win situation.

Research will be conducted in the next few years to follow the results, learn the consequences and draw conclusions.

References Amit M. and Fried, M. (2002). High stakes assessments as a tool for promoting

mathematical literacy and the democratization of mathematics education, J. of Mathematical Behavior, 21, 499-514.

Amit, M. and Fried, M.(2002). Curriculum Reforms in Mathematics Education, in L. English (ed.), Handbook of International Research in Mathematics Education,Lawrence Erlbaum Associates, Inc., Publishers, Mahwah.

Friedman, I. and Ben-Galim, N.(1996). Final exams and NCE exams in selected countries-aims and performance. Henrietta Szold National Inst. for Behavioral Sciences Research , Jerusalem, Israel.

Levy, A. (1995). Educational Alternatives to the Bagrut Examination, Center for the Research of Social Policy in Israel, Jerusalem.

Page 58: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

48

Appendix:

The Mathematical Curriculum According to the Modular Model of the NCEM

Ordinary Level

Questionnaire 1 Questionnaire 2 Questionnaire 3

*Equations and systems of equations of the 1st and 2nd degree.

*Word problems

*Algebraic skills: equa-tions and systems of equations with parame-ters

*Rational expressions

*Word problems

*“Real-life” graphs *Extension of the idea of exponents

*Problems of growth and decay

*Sequences: arithmetical and geometrical se-quences (finite and infinite), problems involv-ing sequences

*Recurrence relations

*Basic concepts in analytic geometry: lines

*Pythagorean theorem

*Behavior of functions *Analytic geometry: lines, midpoint of seg-ments, circles, ellipses

*Arithmetical sequences *Linear programming

*Trigonometry: trigono-metric functions, prob-lems involving right triangles, rectangles, rhombuses

* Trigonometry: trigo-nometric functions, applications in plane and solid geometry

*Differential Calculus: polynomial functions,

kxaexx

,,1,

derivatives of products, chain rule, applications

*Descriptive statistics (not including the normal dist.)

*Elementary Probability

*Probability

*Normal distribution

*Integral Calculus: primitive functions, calculation of areas

Page 59: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

49

Advanced Level

Questionnaire 3 Questionnaire 4 Questionnaire 5

*Word problems

*Sequences: arithmetial and geometrical se-quences (finite and infinite), problems involv-ing sequences

*Recurrence relations

*Equations and systems of equations with pa-rameters

*Solution methods involving substitution (e.g. for biquadratic eqs.)

*Inequalities

*Factoring algebraic expressions

*Exponential expressions and logarithms

*Growth and decay

*Analytic geometry: lines, midpoint of seg-ments, circles, ellipses

*Trigonometry: trigono-metric functions, perio-dicity, simple trig. equa-tions, analysis of poly-gons divisible into right triangles

*Trigonometry: law of sines and cosines, finding the angles of triangles and other applications in plane geometry

*Differential Calculus: polynomial functions,

kxae,x,x1

,

derivatives of products, chain rule, applications

*Differential and Integral Calculus: rational, expo-nential, logarithmic, and trigonometric functions

*Euclidean Geometry: congruence and similarity theorems properties of triangles, quadrilateral, and the circle

*Integral Calculus: anti-derivatives, calculation of areas

*Probability: calculation of probabilities from everyday life or ‘classi-cal’ probability

Page 60: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

50

Highly Advanced Level

Questionnaire 5 Questionnaire 6 Questionnaire 7

*Inequalities

*Factoring algebraic expressions

*Word problems

*Inequalities with abso-lute values, inequalities with parameters

*Analytic geometry: lines, circles, conic sections, locus problems

*Trigonometry: law of sines and cosines, finding the angles of triangles and other applications in plane geometry

*Sequences: arithmetic and geometric se-quences, finite and infinite series, recursive definitions

*Vectors and their use in geometrical calculations and proofs

*Mathematical induction *Complex numbers

*Euclidean Geometry: congruence and similar-ity; theorems properties of triangles, quadrilateral, and the circle

*Trigonometry: ad-vanced applications in solid geometry (e.g. properties of inscribed Figures), trigonometric identities

*Exponential and loga-rithmic functions from the point of view of algebra and calculus

*Probability: calculation of probabilities from everyday life or ‘classi-cal’ probability

*Investigation of ra-tional, root, and trigono-metric functions

*Limit concept

*Implicit differentiation

*Inflexion points, con-vexity, maxi-mum/minimum problems

*Axiomatic approach to the exponential function

Page 61: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

51

Trends of Assessment in German Mathematics Education Gabriele Kaiser, University of Hamburg

Short remarks on characteristics of the German school system - Federal constitution of Germany with 16 federal states (“Länder").

Cultural autonomy

A specific feature of the Federal Constitution of Germany is that the Federal states (Länder) are given cultural autonomy. This implies that each state is authorised to regulate curricula and to determine the textbooks, time sched-ules, and to decide about the professional requirements and the recruitment of teachers as well.

In order to co-ordinate the Federal states’ policies concerning education there is a Standing Conference of the Ministers of Education and Cultural Affairs.

Structure of the German school system

The lower secondary level of the German educational system - for pupils of age 10 up to age 16 - is characterised by its tripartite structure: Hauptschule, Real-schule, Gymnasium. In contrast to that, the structure of the upper secondary level of education - for pupils of age 16 up to age 19 - is a dual one, as it sepa-rates general education from vocational education.

Two systems in Germany at upper secondary level

General education

Three-years schooling, university entrance qualification, 31% of age cohort

Vocational education

Two or three years, vocational qualification, 69% of age cohort

Basic courses

Broad general education; 3 lessons weekly

Advanced courses

In-depth introduc-tion to academic study, two sub-jects selected, 5 lessons weekly

Full-time

16% of age cohort; great variety of school types

Part-time

53% of age cohort; coopera-tive apprenticeship at school and workplace

Page 62: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

52

Assessment in mathematics teaching at upper secondary level (general education)

Structure of general education at upper secondary level

At upper secondary level a great variety of subjects is taught, either at basic or advanced course level.

The students select two advanced courses. However, this students’ choice is restricted by the condition that one of the selected courses must be either Ger-man, a foreign language, mathematics, or a science subject.

Mathematics remains a core subject and compulsory until Abitur (school leaving examination) due to the orientation of the German school system towards general education. Until recently it was possible to drop mathematics in grade 13 which has been done by 10% of the students.

For the advanced courses mathematics is chosen by 34% of the students, which means that mathematics is chosen second most as advanced subject. But there are considerable differences between the Western and the Eastern federal states and between males and females, which can be seen in the following table.

Western federal states

Eastern federal states

Males Females Total

32% 41% 47% 26% 34%

National uniform examination standards in Abitur examinations

The national uniform examinations standards in Abitur examinations (so-called EPA standards) were established at the beginning of 1970 and have been revised in 2002.

By these EPA standards central aspects of mathematics education in upper secondary level have been determined; the most important are:

• mathematics teaching consists of three pillars, namely calculus, linear algebra or analytical geometry, probability and statistics

• calculus is the most important topic area

• the three pillars are compulsory for Basic and Advanced courses

Furthermore, the EPA standards are oriented towards fundamental ideas such as approximation, modelling, chance, algorithms, functional relations (for details see http://www.kmk.org/doc/beschl/epa_mathematik.pdf).

Page 63: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

53

Differentiation between Basic courses and Advanced courses

Basic courses aim for elementary or basic education, which function as propaedeutic. They form the so-called “basic education” and aim to promote mathematical literacy.

In Basic courses the teaching of fundamental topics is emphasised as well as exemplary insight into fundamental methods and interdisciplinary aspects of mathematics. Furthermore Basic courses aim for insight into the differences between everyday life knowledge and scientific knowledge. Mainly, those basic courses of mathematics are attended by students who prefer to special-ise in languages and social sciences.

Advanced courses are characterised by a more systematic and reflected propaedeutical way of teaching and learning. Advanced courses intend to give broad insight into complex mathematical topics and an independent and reflec-tive mastery of mathematical methods. Advances courses are mainly attended by students who are planning to study mathematics, sciences or technical subjects.

In the required standards there are differences between the basic courses and the advanced courses concerning the taught degree of structure, complexity, diffi-culty, openness of problems and the given measures.

Examples from these regulations

Example 1: Oblivimie

2% of the population are „0-people“. This means people whose blood bears pathogens of the disease „Oblivimie“. In a accelerated test 94% 0-people are recognised but also 8% of non-0-people are classified as 0-people.

Task at Basic course level

What is the probability at which the test defines selected persons as 0 people or as non-0-people? Explain your approach.

Task at Advanced course level

What is the probability for a positively tested person to be really a 0-People? Explain your approach and judge your result.

Page 64: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

54

Example 2: Derivative

Look at the following data:

x 0.0 0.2 0.4 0.6 0.8 1.0

f(x) 3.7 3.5 3.5 3.9 4.0 3.9

Task at Basic course level

The listed data are part of a differentiable function.

Determine an estimated value for f’(0,6).

Task at Advanced course level

Determine various estimated values for f’(0,6) and explain the needed assump-tions. Compare and judge the methods you have chosen.

Which method do you give preference to? Describe that method generally by means of a term.

Example 3 for Written Examination at Basic course level: Deficient equipment

A company investigates if they can offer its clients a subsequent guarantee period for an equipment on the expiration of the legal guarantee period.

Due to the company’s experience the following assumptions can be made:

1. Defects occur independently from each other with the three compo-nents T1, T2 and T3 of the equipment, respectively with a probability of 0.2.

2. A component that has been repaired does not fail again during the con-cerning time period.

3. The cost of materials for the repair is 50 € for T1, 40 € for T2 and 10 € for T3.

a) Calculate the probabilities of how often the possible combinations of repair of a piece of equipment may occur.

b) For how many pieces of equipment out of 1000 must at least one repair be expected?

Page 65: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

55

c) For each repair there will arise 30 € labour cost. The company in-vestigates the following three variants of contract agreements.

Variant 1:

No subsequent guarantee (for the piece of equipment) will be offered.

Variant 2:

For the equipment a subsequent guarantee only for cost of materials will be offered. For this reason the price of the piece of equipment will be raised by about 20 €.

Variant 3:

For one piece of equipment a subsequent guarantee for cost of ma-terials and labour cost will be offered. For this full cost guarantee the price of the piece of equipment will be raised by further 15 €.

Judge these variants from the company’s and the customers’ point of view.

Example of a New Mathematics Syllabus – the Case of Hamburg

In 2002-2004 a new curriculum for upper secondary level has been developed in Hamburg, which is now under evaluation (for details of the syllabus see Freie und Hansestadt Hamburg, 2004). In addition extensive teaching materials are in the process of being developed, which shall assist the teachers to transfer the new syllabus in practice (see Freie und Hansestadt Hamburg, Handreichun-gen).

The educational goals and principles

The goals of this new mathematics syllabus include that students shall be enabled to understand mathematical phenomena in the real world, to give them insight into mathematics as a deductively structured world as well as to develop problem solving abilities.

Page 66: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

56

Educational principles underlying the new syllabus are:

• orientation at fundamental ideas and reduction of algorithms

• networking as orientation base

• mathematics as language

• promotion of real world examples and modelling

• consideration of new technologies

• strengthening of autonomous activities

• open questioning

• modular structure, no strict order of content

• start a module with a paradigmatic example

The new syllabus differs from the former one in many ways, especially concerning the newly introduced modular structure of the syllabus, the strong promotion of real world and modelling examples together with the emphasis on new technologies.

Modules for Pre-courses in Year 11

The syllabus consists of the following modules for pre-courses in year 11, which have to be taken by all students and which are not differentiated according to mathematical abilities.

Module 1 - From data to functions

• Systematising of the known function classes, connection with descrip-tive statistics

Module 2 - From average to local rate of change

• Introduction of derivative

• Usage of physical motions as introductory example

• One out of the following 4 modules is compulsory:

Module 3 - Graphs

• Shortest or, respectively, the longest ways in a network plan, travelling salesman problem

Page 67: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

57

Module 4 – Iteration

• Discrete description of growth processes, e.g. of the German popula-tion

Module 5 - Linear optimisation

• Simplex method

Module 6 – Geometry

• Introduction into analytical geometry

Modules for Basic and Advanced courses (Year 12-13)

In year 12 and 13 the students select mathematics either at basic course level or advanced course level. The topics of the modules are the same for the two kind of courses, but there are differences between them under various per-spectives: On the one hand epistemological aspects are emphasised within the modules such as mathematical reflections on the concept of limit or mathematical reflections on proofs. On the other hand the examples are characterised by a higher level of complexity.

Module 1 - From rate of change to accumulated change (live stock, inventory, reserve, population, variety)

• Introduction of integral

• Connection to rate of change

Module 2 - Matrices and vectors as data store

• Description of discrete population models – Leslie matrix

• Systems of linear equations

Module 3 - From share prices to random walk

• Prognosis of share prices using random walk models

• Concept of chance and probability

Module 4 – Rate of change and accumulated change

• Periodical processes such as tides, biorhythm

• Growth processes such as radioactive decay

• Central theorem for differential and integral calculus

Page 68: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

58

Module 5 - Analytical geometry

• Usage of navigation system GPS

Module 6 - Applied problems of probability and statistics

• Hypothesis test

• Theorem of Bayes

Exemplar of a module

From rate of change to accumulated change (module 1)

The following example may serve as paradigmatic example in order to introduce the concept of the integral.

A hot balloon is in the air for a longer time. On board is an instrument for measuring the balloon’s speed. During his journey the wind shifts twice which we describe as “negative" speed. If we note the speed in relation to the passed time into a coordinate system, then we get the following diagram which shows the balloon’s journey from the start until its landing. :

Page 69: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

59

Questions

• What are the data that are not available but needed for interpreting the graph. Make suitable assumptions. Try to describe the balloon’s jour-ney by using them.

• How long is the travelled distance?

• What is the balloon’s distance from the starting point when landing?

• Describe one approach, which allows to conclude from the information about the development of the balloon’s speed on the travelled distance.

Further prospects

In future there will be strong changes of the educational landscape caused by new standards, which are now under development. These standards for the core subjects such as mathematics will be compulsory for mathematics education at primary and lower secondary level. They are oriented towards fundamental ideas, different levels of qualification and a central core of mathematical content differentiated for the different types of school (for details see http://kmk.org/schul/Bildungsstandards/bildungsstandards.htm). In addition, there will be established regular national tests at primary and secondary levels oriented at these standards. At present the development of several hundred items is under its way, which will be trialled within the field-tests for PISA 2006 and probably TIMSS 2007. These items shall lead to proposals for cen-tralised tests for all Federal states as well as for special tests for each Federal state.

Furthermore school leaving examinations (especially Abitur) will be centralised in most of the Federal states, e.g. in Hamburg in 2006. This centralisation will take place within each Federal state and it is not expected that students will take the same examinations all over Germany.

References Einheitliche Prüfungsanforderungen in der Abiturprüfung. Beschluss der

Kultusministerkonferenz vom 24.5.2002. http://www.kmk.org/doc/beschl/epa_mathematik.pdf

Freie und Hansestadt Hamburg, Behörde für Bildung und Sport. Rahmenplan Mathematik. Bildungsplan Gymnasiale Oberstufe. 2004 http://lbs.hh.schule.de/bildungsplaene/GyO/MATHE_GyO.pdf

Page 70: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

60

Freie und Hansestadt Hamburg, Behörde für Bildung und Sport. Handreichungen für das Fach Mathematik. www.mint-hamburg.de/handreichungen/handreichungen-gyO.html

Vereinbarung über Bildungsstandards im Fach Mathematik für den Mittleren Schulabschluss.

Beschluss der Kultusministerkonferenz vom 4.12.2003. http://www.kmk.org/schul/Bildungsstandards/Mathematik_MSA_BS_04-12-2003.pdf

Vereinbarung über Bildungsstandards im Fach Mathematik für den Primarbereich (Jahrgangsstufe 4) vom 15.10.2004 http://www.kmk.org/schul/Bildungsstandards/Grundschule_Mathematik_BS_307kmk.pdf

Page 71: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

61

National Testing On Line How Far Can We Go? Steven Bakker and Gerben van Lent, ETS Europe

What we would like to see…

A colleague once said to me: “My idea of the ultimate test in French as a foreign language would be to take the student to a Paris terrace as soon as he would be ready for it, see if he would be able to order a bottle of wine and then, for the highest score, see if he would be able to share that bottle with the young lady at the next table”. The test would be ‘perfect’ indeed but rather inefficient. We all have high hopes of the solutions that can be provided to such dilemmas by IT applications. Computers should finally enable us to deliver national exams on demand, creating virtual real-life situations for candidates who will be allowed to demonstrate their real skills rather than their book learning. For students who are used to playing adventure games over the internet, it would only be normal if similar techniques would be used for the delivery of national exams. If we could gaze into a crystal ball and catch a glimpse of the exams of the future, we would not be surprised - in fact we would rather expect to see - students in the centers of secure virtual private networks connected to test banks, marking machines, data analysers and reporting engines. Results would be available in real time, of course…

But how far can we go…

The question of course is, ”How far can we go before we start losing key aspects of the quality demands that we set to exams as we know them now?” Clearly, this presentation is assuming that exams will survive the change of times and that certain characteristics of national exams will not alter, such as the idea that the reliability of the results is guaranteed by a party that has no interest in the outcomes of individual students as such. Within that context, we will explore some IT applications that have already been put in place for facilitating any of the five key phases of any national testing effort:

Page 72: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

62

• Test development

• Test administration

• Marking

• Processing and analyzing scores, standard setting, awarding

• Reporting

Examples discussed here are taken from the US and The Netherlands, but in many other countries like e.g. the UK similar developments are taking place. The presentation will conclude with highlighting some of the main challenges along the road to a future of exams that, thanks to IT applications, are more valid, more reliable, more objective, more flexible, more efficient, less labour-intensive and still acceptable. At the end of the presentation it would be inter-esting if the audience could share some of their experiences.

Test production

The time when item banks were physical boxes with items typed on tagged cards lies far behind us. Word processors have long since changed the proc-ess of test development. Most of the larger test producers nowadays run sophisticated systems for producing, revising and maintaining items, and assembling test according to pre-set specifications. ETS developed for its own purposes the so-called Test Creation System (TCS) which comprises several software components integrated within a customized Lotus Notes application. TCS extends ETS capabilities beyond the banking of classified items to the facilitation and management of item writing and review proc-esses, test assembly and layout, ongoing inventory analyses, and monitoring of item and process performance indicators. The system is designed to support efficient work processes, including ‘cloning’ of items. To enhance the quality of the test creation process, ETS introduced the Evidence Cen-tered Design approach which includes a prototype software tool that links the purpose of the assessment to proficiencies, evidence of these proficien-cies and tasks that can provide such evidence. CITO, another large test producer based in The Netherlands, is using a system consisting of two modules for item construction (ItCon) and storage and selection (THPro). Both systems are institute-specific and neither was developed or meant for usage by third parties. Commercially available products (e.g. Questionmark Perception, Examsoft) usually lack the functionality that these larger institu-tions require.

Page 73: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

63

Test delivery

If anything, it is for test delivery that expectations of IT solutions are high, and yet the moment that you can take a secure test at home, any time any day, is still rather far away. Web-based registration is common and offering exams on a CD-ROM rather than as a strictly paper-and-pencil test has been extensively piloted in the Netherlands during the last three years. This approach has created much better opportunities for assessing skills in a real-life context. For this year’s administration of the national exams, the use of computers is required for a fair number of papers. A good example is the 16+ award-winning French exam, which is fully delivered on-screen (after installation on each stand-alone PC from a CD-ROM). Students are taken along on a story line and given reading, listening and writing assignments. The listening assignments include video clips and the writing assignments are framed in a computer-context such as chatting and e-mail messages.

While integration of computer use in exams, or even fully on-screen exams, seems to be common for the near future, there still is a great deal of uncertainty about the feasibility of on-line exams delivered through the web from a central server. Insuffi-cient external infrastructure in the sense of not being able to accommodate logging-in of tens of thousands of students to one site at the same time is one thing, but a main stumbling block at the moment is the inadequate computer facilities and system management in schools. Professional test centers as currently run by Pro-metric (Baltimore, US) or Lamark (Amsterdam, The Netherlands) have solved these problems as well as provided secure and fraud-free delivery. Currently, of the annual 740.000 TOEFL tests, 83% is delivered through one of the 550 Prometric test centers around the world as computer-adaptive tests. These centers, however, cannot handle large quantities of candidates at the same time, and tend to be cost-inefficient if not used frequently. Once we have implemented the practice of using not the same test but the same measurement for every student as the requirement for a fair and acceptable national examination, combination of professionally run test centers and advanced test generation technology may provide a way of truly on-line, flexible and customised national exams.

Marking

The scoring of student answers to multiple-choice questions using a machine was probably one of the first IT applications in large scale test technology, and it re-mains an important application. A logical next step, but endlessly more complicated to achieve for obvious reasons, is the scoring of answers given to open-ended ques-tions. In contrast to what one might expect at first, it has been the scoring of essays which has been the first implementation of this idea. E-rater™, the software engine

Page 74: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

64

behind the ETS-developed Criterionsm program is now achieving a 95% agreement between machine and human scores on a 6 point holistic scale and is used on a large scale for marking essays of high-stakes exams such as the Graduate Manage-ment Admission Test (GMAT). Detailed feed-back on style, word usage and organisation and development of the essay is another feature offered by Criterion for formative use. A next step forward is the ETS c-rater, a software tool that will score constructed-response answers to questions designed to measure understanding of content materials. A prototype has been tested out for 4th and 8th grade math questions during the NAEP Technology Based Assessment project. Outcomes showed 87% agreement between c-rater and a human marker, as compared to 92% agreement between two human markers.

Another important IT application in the marking process is software for organising the process itself and enhancing its efficiency and quality. For these purposes, ETS developed the Online Scoring Network™ (OSN) which is now being used in the US for scoring vast quantities of student scripts produced in large scale high-stakes assessment efforts, such as the California High School Exit Exams. Student scripts typed on a computer, or scanned hand-written copies, are saved as easily manage-able packets that are then electronically sent to markers who first receive on-line training and then score the responses on-screen. Apart from gains in efficiency, this approach is also believed to add to the objectivity of the marking. The software facilitates real-time monitoring of this process and checks if the marker is on track, consistent and in line with standards. At the same time, as the markers work, they receive support having easy access to scoring rubrics and specimen answers.

Analyzing

The tremendous increase of computer availability and processing power has had a large impact on approaches in test and item analysis. Over the last decades, the use of processing-intensive methods based on item-response theory has become com-mon for professional testing and changed our views on proficiency concepts. So-phisticated analysis of a students’ responses while the student is taking the test have paved the way to the development of computer-adaptive testing and opened a world in which tests only come into existence the moment they are delivered.

Reporting

Using the internet for reporting test results has not only changed the time span within which results become available to stakeholders, but also the detail and practical value of the results. A characteristic example may be found in the way the results of the US National Assessment of Educational

Page 75: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

65

Progress (NAEP) campaigns are made available. Any US citizen (or anybody with access to the Web) is now just two mouse clicks away from thousands of NAEP questions complete with student performance data, scoring guides and sample student answers. Further clicks give access to detailed state profiles and even to a complete data tool that allows one to make perform-ance comparisons between different groups, learn how background variables influence outcomes, and run tests for statistical significance of these differ-ences in outcomes. (http://nces.ed.gov/nationsreportcard)

So where are we going?

From what we have outlined above, it has become clear that IT already has pervaded and changed all aspects of large scale, high stakes testing. Many sophisticated applications have been developed for every key aspect and most of the larger professional test publishers are working on integrating these. A bot-tleneck here usually is the integration of institutional test development modules into commercially available test delivery platforms. Dutch experiments explor-ing the feasibility of computer use in the Dutch national exams indicate that, at the moment, there are probably no adequate software programs commercially available for full computer-based delivery of exams, i.e. on-screen presentation of the test, saving student responses in a data file and marking using the com-puter. An even bigger problem, however, is using these delivery platforms for large numbers of candidates at the same time while maintaining security. While the adequacy of the infrastructure outside testing centers (limited capacity of the cable network and servers) is often mentioned as an important source of uncertainty, the same Dutch investigations indicate that, at the moment, it is the infrastructure within the schools that would not allow for a fully-fledged computer-based delivery of exams, over the web or from software installed on a center-based server.

At the same time, though, the results showed that schools taking part in experiments could handle computer-based administration of exams, even when large numbers of students were to take the same test at the same time. Students did not report that exams delivered to them in this way constituted any problems as compared to paper-based administrations. On the contrary, students found the use of computers during exams a normal thing and rather liked the way the tests were presented.

While technology becomes more readily available and daily life continues to be changed by computers, further steps towards truly on-line testing for national exams seem inevitable. It just doesn’t make sense for instance to

Page 76: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

66

ignore the fact that communicative use of a foreign language includes writ-ing e-mails and notes in chat-sessions rather than writing a letter. While the first efforts of computerizing aspects of large-scale tests involved replicating paper-based routines on computer, IT applications over the last few years have marked fundamental changes of practices in large scale testing. What hasn’t changed is the belief that demonstration of knowledge and skills against set criteria and standards is a necessary step in giving admission to educational facilities, jobs and social positions. What may change is the belief in the necessity of the same test used at the same time for national exams as we know them now, if at least IT applications can provide ways of administering the same measurement to candidates throughout the year, attaining the same or better reliability, validity, efficiency and acceptability. It is difficult to predict for one country let alone on a global scale how far away we are from this stage in the evolution of national exams but consider-ing the speed of developments in the IT sector, revolutionizing so many aspects of our economy, it may well be not that far ahead.

Page 77: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

67

The Swedish National Course Tests in Mathematics Jan-Olof Lindström, Umeå University

Background

The reformed school

In line with the trend of decentralizing the running of the public sector the Swedish school has been reformed and more responsibility has been given to the local school-authorities and teachers. Since 1994 a new system has been in force and for the upper secondary school it means:

• Governing through established goals and objectives and through an ac-companying evaluation of outcomes.

• Organisation of the teaching in programmes and courses.

• Grading students according to standards in the four levels: Not passed (IG), Pass (G), Pass with distinction (VG) and Pass with special dis-tinction (MVG).

• Providing national course tests (NCT) that are available at the end of term for each particular course in Mathematics as well as in English and Swedish.

The previous system that had been in force since 1967 included the so-called “Central Tests” that were instruments used for the norm referenced grading of groups of students. The standardized tests were provided by the National Agency for Education (NAE). The teachers marked the tests, reported the results for a random sample of students and could then set up a distribution of grades on the test for his class on the centrally established five-point scale. The final grades in the subject that the teacher finally gave to the students in the particular class should then correspond to the meas-ures of location that had been acquired as a result of the test.

The new system of grading is based on criteria for the different levels of attain-ment and would accordingly seem better served by criterion based testing. This

Page 78: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

68

meant that test constructors and teachers had to do some rethinking about the development of the standardized tests and the grading of the students.

The assignment

The NAE commissioned departments in four different universities to parti-cipate in the development of a system of criterion referenced national tests. The Department of Educational Measurement (DEM) at Umeå University was assigned the task of developing tests in mathematics for the upper secondary school.

A new agreement has been in force since March 2000. There the idea that a second part of the national test should be carried out as class work has been abandoned. Instead the whole test should now be taken at one occasion under regular conditions for written tests. However, the test should include a specific, more extensive task that should be scored analytically according to given aspects much in the same way as before. The test can also include different units as for example one part taken without an electronic calculator. An oral part should also be included in some of the course tests seen over a certain time span.

According to the current rules and regulations the teachers carry the chief responsibility for assessing a student’s achievement and assigning the final grade for the course. If available, the result of a national test should be considered as one of the pieces of information on which the teacher should rely when issuing a student’s grade on a particular course. Until the autumn term of 2000 partici-pation in a NCT was voluntary and to be decided by the school. From then on it is compulsory for the school to arrange NCTs in the highest mathematics course in a particular programme. In addition, the test for the A-course is compulsory in all programmes. The result of the NCT should still be seen as just one of the factors that the teacher has to take into consideration when grading the student. The NCT should provide opportunities to show the skills and knowledge that correspond to the assessment levels of G (Pass), VG (Pass with distinction) and MVG (Pass with special distinction) in the parts of the knowledge domain that the test (according to the objectives specified in the information material attached to the test) is intended to measure.

In summary the NCT should be serving both the purpose of measuring the knowledge gained by the student and of providing information about what the student still has to learn. According to the contract the NCT should therefore be an instrument for

Page 79: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

69

• establishing a basis for valid inferences about the student’s mastery of the course according to the national goals and objectives

• producing a test grade – the teacher gives the course grade on the grounds of the student’s total performance in the particular course

• guaranteeing equity in the education received in different schools all over the country

• stimulating discussion among students and teachers about the aim, content and methods of mathematics education

The framework

The curricular documents

The process of constructing an NCT emanates from an analysis of the steering documents. The curriculum1 and the programme goals2 are gener-ally formulated. There it is stated that the school shall strive to ensure that all the pupils can use their knowledge as a tool to

• formulate and test assumptions as well as to solve problems

• reflect on what they have experienced

• critically examine and evaluate statements and relationships etc.

Important key words and aspects of knowledge in the goal documents are;

• appreciation of mathematics,

• knowledge of concepts and procedures,

• ability to formulate and test assumptions,

• ability to solve practical problems and tasks,

• modelling, argumentation and reasoning,

• representation and communication.

1 The curriculum for the non-compulsory school system (Lpf 94). Skolverket (1994) 2 There is a special goal document for each of the seventeen national programmes in the Swedish upper secondary school.

Page 80: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

70

In the process of constructing and combining different questions into a test, our ambition has been to include items so that most of the different aspects are covered. Problem solving, authenticity3 and openness have become a set of prestige words in this line of work.

More specific objectives are laid out in the syllabi4 although the descriptions of the different content areas and topics are very brief. Therefore it is left to the individual teacher or the test constructor to decide how much weight a particu-lar topic should carry in each course. So far it does not appear to be much of a problem. The interpretation of the summary prescriptions of the content in each of the five A – E courses seems to be consensual within the Swedish upper secondary school teacher community.

The format of the test

The NCTs in Mathematics have mainly been given in the form of a written exam throughout the years since the new system was introduced. A voluntary oral task or a practical task has occasionally been included in the examinations.

The items or tasks of the test have been a mixture of different types prompting answers of different forms

• Selected response (multiple choice, selected completion etc)5

• Short answer (where just the result of a numerical calculation or a word or a short sentence is required as an answer)

• Long answer (extended answer questions or constructed response tasks requiring explanations of how a final answer was received)

• Essay answers (e.g. a performance assessment task with an invitation to write a paragraph or two on what the student knows about a topic).

The experiences from the first years with NCTs including parts taken on different occasions, led to the decision by the NAE that the test should now be given on one occasion. The test shall include an extensive task that is to

3 See Palm (2002) 4 See the appendix on The syllabus for Mathematics C (100 points) according to the regulations valid from 2001. 5 This type of questions has been given only on a few occasions during the past years.

Page 81: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

71

be scored analytically in accordance with given aspects6 much in the same way as the open solution task.

Materials for oral examinations are offered to be used as a voluntary op-tional part of the NCTs for courses C and D. The material is available on the Internet.

Resources

In accordance with the specifications that the test results should be used for evaluative purposes the directions have been tightened. A special collection of formulas is provided with the tests and the allowed type of calculators is speci-fied. The tests in all courses are given in two parts; one to be taken without a calculator and another where a calculator of designated types is allowed or even a prerequisite.

A blueprint

In a first stage of constructing a blueprint for the course test the syllabus is studied and the weight for each topic in the particular course is decided.

Table 1. Topics according to the syllabus 2000 for the different courses in Mathe-matics. The proportions given in the table are results of the initial discus-sions in the consultative groups.

Course A B C D E

Topic 100p 50p 100p 100p 50p

A Algebra 20% 50% 20% 35% D Differential Calculus 50% 20% 25%

D Integral Calculus 40% 40%

F Functions 10% 30%

G Geometry 25% 10% 5%

R Arithmetic 35% 25%

S Probability - 7

S Statistics 10% 10%

T Trigonometry - 8 40%

6See the appendix on Assessment schedule based on three aspects 7 Included in Statistics 8 Included in Geometry

Page 82: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

72

The performance levels constitute a second dimension. The questions in the test could be aimed at indicating knowledge corresponding to different grades from G up to MVG. A particular course test could be designed to give a higher reliability in the decision either Not passed/Pass (IG/G) or Pass/Pass with distinction (G/VG).

Apart from giving a general idea of what should be assessed by the course test the analysis of the syllabus and the grading criteria results in a preliminary idea of how the items should be distributed. This could be described as a “blueprint” of the test, in other words a plan for how the assessment objectives should be covered by the test. Figure 1 shows how such a blueprint for a national test for course C in mathematics is recorded in a form that is being used during the construction of a test. This is done in the row labelled “Plan”. The summary in the row labelled “Outcome” is compared to the “Plan” after completing a suggested combination of tasks (items).

National Course Test in MathematicsMathematics Cours e C Spring 2002 Topic Pass (G) Pas s w . dis t. (VG) (MVG)

Task Max

. poi

nts

Tim

e /m

in.

Res

ourc

es

Res

pons

e ty

pe

Ope

nnes

s

Con

text

Cog

n. L

evel

Cog

n. C

onte

nt

Pret

este

d

Fairn

ess

Lang

uage

/logi

cs

Mar

king

Inst

r.

Gen. aRit. Algebra Diff & Int. Cal. Gra

ding

cr

iteria

g-po

ints

Est.

poin

ts fo

r G-

Gra

ding

crit

eria

vg-p

oint

s

Gra

ding

crit

eria

# nr ID 1 4 2 3 6 7 8 1 2 3 4 1 2 3 4 1 2 3 4 5 6 1 2 3 4

Plan: 240 50% 40% 10%

1a Dif ferentiate 1 2 No S 0 0 O A X X X X 1 x x 1 1 0

1b Dif ferentiate 1 3 No S 0 0 O A X X X X 1 x x 1 1 0

2 Find x for minimum 2 5 No L 0 0 O A X X X X 2 x x 2 2 0

3 Geometric sum 1 5 No M 0 S S A X X X X 1 x 1 1 0

4 lg80 1 5 No M 0 0 O B X X X X 1 x 1 0 0

5 Find minimum 3 15 No L 0 0 S A X X X X 1 1 1 0 0 x x x x 3

6a Explain w hy f '(k)=0 1 10 No L 0 0 P R X X X X 2 1 0 0 x x x x x 1

6b Explain w hy f '(k)=0 2 15 No L 0 0 P R X X X X 2 1 0 0 x x x x 2 x x

7 Kajsa 1 5 G L 0 S O A X X X X 1 1 x x 1 1 0

8a Formulate question 1 10 G L 1 0 O B X X X X 1 x x 1 0 0

8b Solve equation 2 5 G L 1 0 O A X X X X 2 x x 2 1 0

9 Simplify 2 5 G L 0 0 S A X X X X 2 x x 2 1 0

10 Which mistakes 3 10 G L 0 0 S R X X X X 3 x x x 2 1 x x x x 1

11a Heart disease P(0.2) 1 5 G S 0 H O A X X X X 1 x 1 1 0

11b P'(0.1) 1 5 G L 0 H O A X X X X 1 x x 1 1 0

11c Explain P'(0.1) 1 5 G L 0 H O B X X X X 1 0 0 x x x 1

11d Time for closing 2 5 G L 0 H S A X X X X 2 x x 2 0 0

12 f '(0.6) w ith 2 method 3 10 G L 1 0 O A X X X X 1 1 1 x x 2 0 x x x x 1

13a Seal 2 10 G L 1 N S M X X X X 1 1 0 0 x x x x 2

13b Seal 2 5 G L 1 N S M X X X X 2 0 0 x x x x 2

14 f(10) 2 10 G L 0 0 P B X X X X 1 1 1 0 0 x x x x x 2

15 Vases 7 60 G L 1 S P R X X X X 6 1 x x x 3 2 x x x x x 4 x x x

Outcom e : 42 210 0 23 13 19

Outcom e : 55% 45%

5% 25% 20% 50%

11/13

57%

8/3

0% 26%

4/3

17% Figure 1. An example of a form used during the process of constructing a test for

Mathematics course C

Page 83: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

73

Rubrics

Each test is provided with a specific marking scheme where points are awarded for different parts or aspects considered to be required of an acceptable answer (for example 1 g- and 1 vg-point9 and so on).

The rubrics or marking schemes are usually specific in the sense that they are applicable only to expected work on a particular item. The work of a student on a certain item can be judged holistically but the marking scheme usually implies an analysis of the solution as for

• shown understanding of the problem and ability to choose a suitable strategy to solve it,

• shown skill in performing a suitable procedure and needed algorithms with proper conclusions, and finally

• shown ability to communicate his/her work in an understandable and consistent form.

Each test includes one specific extended task. The performance of the student on that particular item is judged from the following three aspects:

• Choice of method and accomplishment.

• Mathematical reasoning.

• Presentation and mathematical language.

Considering these aspects as giving three not necessarily parallel compo-nents, facilitates the scoring in three scales each beginning with g-points and ending with a conglomerate of g- and vg-points. As these extended tasks also are meant to elicit work by the students that can be an indication on that the highest grade is earned (tasks marked with ¤) these records could be of special use when deciding if a particular work of a student is an indication of that grade. The score is however also included in the total sums of g- and vg-points and in most cases just contribute the decisions on the lower grades based on the cut-scores.

It has not been shown that this complex way of awarding points to a stu-dent’s solution increases the inter-rater reliability although it might be a possibility. The major reason for awarding points in this way has been the belief that it increases the validity of the test in that it more clearly points

9 See next paragraph on Grading

Page 84: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

74

out what competencies are meant to be measured by the particular task, thereby serving an educative purpose vis-à-vis the teachers in exemplifying how the goals and criteria can be interpreted.

Grading

The items are classified according to the possible inferences about the applica-bility of criteria at the different grade levels that could be drawn from the student’s work on a certain item. The points attributed to a certain item are consequently partly or totally being either g-points (Pass-) and/or vg-points (Pass with distinction-). This means that the result of the test can be reported in separate g- and vg-scales. The possibility to utilise a two-dimensional score in setting the cut scores was not fully employed until the tests in the spring of 2000 and the grade VG was up till then given for a total score that surpassed the chosen grade boundaries for the total sum. It should however not be possi-ble to get a VG just by collecting many points from tasks that only is intended to prompt work on G-level. Therefore this cut score was chosen so that it was impossible to reach without also acquiring a reasonable number of points on items connected to the criteria for VG (see figure 2).

IG

G

VG

MVG vg- points

g-points

Maximum total score

Minimum vg-score for the grade VG

Minimum total score for the grade VG

Minimum total score for the grade G

Figure 2. The two-dimensional scoring model with cut-scores.

Page 85: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

75

The tests in the autumn of 2000 were the first where the score openly was presented as g-points and vg-points. In all the following NCTs the cut-scores have been two-dimensional and expressed in terms of minimum number of g-points and vg-points i.e. the cut-score for VG is then expressed as a minimum total score of g-points and vg-points including at least a certain minimum number of vg-points.

The requirements for the grade MVG has been expressed in terms of a higher minimum number of vg-points and fulfilling minimum performance require-ments according to rubrics for certain items specially marked (¤) in the test.

The test construction process

The test construction of all the tests is the joint responsibility of the project group at the DEM. However, for a particular course there is one appointed responsible test constructor. From the start of the department’s work with the NCTs the policy has been to involve teachers from schools and universi-ties all over the country. Today between 20 and 30 teachers are engaged in the different stages of the setting of a particular test. These teachers are involved as item writers, they participate in carrying out the pre-tests, and they participate in meetings with the consultative groups and standard setting groups. The construction process is illustrated in figure 3.

The Upper Secondary Schools of Sweden

Report

NCT

Data

Term 1 Term 2 Term 3 Term 4 Term 5

Item writers

Project group at the Department of Educational Measurement (DEM) in Umeå

Pre-testing Standard setting

Consultative group

MM M

M

Printers

National Agency of Education (NAE)

Figure 3. The process of constructing one NCT covers two years and involves

teachers who individually or in group meetings (M) interact with the re-sponsible test constructor at the DEM.

Page 86: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

76

Item writing

Teachers are invited to contribute to the test construction by taking on an assignment to write five to ten items with a corresponding marking scheme. The specification of the task includes the type of expected response and valued characteristics like a degree of openness and authenticity as well as a list of general competencies10 that should be tested like

• modelling

• reasoning

• communication and

• concept understanding.

Meetings with the consultative group

A consultative group consists of one member of the DEM (the responsible test constructor for the particular course) and four to six external members. A few of the people involved are university teachers but the majority are experienced upper secondary school teachers or teachers in adult education. They come from all parts of the country. Their task is to review and revise items (for pre-testing or a particular test) and to participate in the construction of the test and rubrics in accordance with the specifications agreed upon. Finally their task is to validate the finished test and to have an opinion on the levels in the standard setting.

In the process every item is being judged from different aspects:

• Linguistic appropriateness - Is the language close enough to the students’ language and correct enough not to stir up teachers’ feelings?

• Logical consistency - Does the task contain ambiguities?

• Pre-testing - Is there a need for the item to be pre-tested alternatively what is the adequacy of the available data?

10 Characteristics of competencies are described and exemplified by a commented choice of items in a booklet Uppgiftstyper i Fysik och matematik med exempel och kommentarer issued by DEM 1999.

Page 87: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

77

• Rubrics - Do the general marking directions enable a consistent marking of the task? - Is there a need to specifically point out which shown knowledge should be rewarded? - Does the rubrics allow for the possibility to use different strategies in responding to the task? - How do the rubrics correspond to the ones used earlier in similar cases?

• Fairness - Is the question fair from a gender/cultural/ethnic/language point of view?

• Validity: Does the task - unduly presuppose a specific method to be used, level of detail, depth etc in the response? - elicit student responses that can be reliably and accurately assessed? - provide grounds for inferences about whether the particular objectives have been reached? - have a likely impact on teaching and learning in a way that corre-sponds to the spirit of the curriculum and syllabi? - provide a desirable picture of the subject?

During the test construction process the items are classified according to the goals and criteria in a matrix like the one in figure 1. The content goals are labelled A1, A2 … for Algebra, C1, C2 .. for Calculus etc. in the table while the grading criteria are labelled G-, VG- or MVG- and numbered within each grade. In a matrix included in the information booklet accompanying each test a final result of a compilation is accounted for (see the appendix).

To facilitate the decision regarding to which grade level a particular item should belong the item can for example be classified considering cognitive aspects and openness. Categories11 like facts and procedures, problem solving, and reason-ing are used. The degree of openness is a classification based on the extent to which the expected response can differ and still be adequate. A closed item prompts just one method or procedure with just one possible conclusion con-

11 The classification of Items for the Test Bank in Mathematics. http://www.umu.se/edmeas/provbank/pbma/pb-info-matte.html

Page 88: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

78

sidered to be an adequate answer. To produce an answer to an open task a variety of assumptions, restrictions, data and/or methods can be selected.

Such a classification can also be used to assess the validity of the test together with evidence that the test is constructed in concordance with the content goals in the syllabus and the grading criteria.

Pre-testing

Schools from all over the country are selected at random and invited to par-ticipate in the pre-testing of a limited number of items. A lesson of 40 – 60 minutes is required and the size of the groups participating could be anything between 30 – 200 students. The aim of the exercise is to acquire a number of solutions to the particular set of questions as well as background data includ-ing programme, sex and preliminary grade for the particular course for each student in the group. The students’ solutions are sent to the DEM where they are marked and the data is analysed.

The pre-testing is meant to indicate answers to the following questions:

• Does the student understand the task in general and do they under-stand what is asked for?

• Does the task mean that any group of students is treated unfairly?

• Does the recorded p-value support the classification of the item based on the grading criteria?

• What are the affective reactions to the item from teachers and students?

• How long does it take to produce the answer for the students that could be expected to manage the task?

• Are the rubrics clear enough to produce a consistent marking of the task?

Standard setting

Throughout the process of test construction the consultative group is concerned with the problem of standard setting12. As part of the assessment of validity and reliability the proposed test is scrutinized from the point of view that students at different performance levels should be given the opportunity to show their

12 See Nyström, P. & Lindström, J. O. (1999)

Page 89: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

79

knowledge and skill and thereby be given a reasonable number of points. Dur-ing the compilation of the test a record is kept on which points the students that are borderline cases are likely to collect when taking the test. (I.e. students that the members of the consultative group from their experience and according to their interpretation of the grading criteria considers as just G- and VG-qualified.) The column “Estimated points for G-“ (G minus) in the record sheet in figure 1 is meant for that purpose. This means that the consultative group already have a certain set of cut scores in mind when they finalize their work.

However in the standard setting a special procedure involving an external committee is used. 10 – 12 teachers are invited to form a standard setting committee. They should be experienced in teaching the particular course and therefore well acquainted with the syllabus and the grading criteria. They ought not to be teaching a group of students who are going to sit the actual test. The meetings (one or two per course test) are arranged in differ-ent places all around the country and are chaired by appointed local leaders connected to a teachers training department of different universities. The leaders are trained to act according to a procedure that is described in a detailed instruction13 that has been developed at DEM.

The committee members receive the exam paper one week before the sched-uled meeting. The actual meeting is planned to take some four hours. At the beginning of the meeting each member renders an individual estimate of the Pass (G) and Pass with distinction (VG) requirements. These estimates are supposed to be made after considering which particular tasks the borderline students should manage. After a discussion about the differences in opinion about the difficulty of different items an Angoff’s procedure14 is carried out usually just once for each of the two levels G and VG. It can also be re-peated with an intervening discussion.

For dichotomously scored tasks (0/1) a simple way of describing the crucial part of the procedure is to say that the committee members should estimate the probability that a student who is just barely qualified to pass (or pass with distinction) would answer a particular question correctly. The proce-dure is adjusted to give separate estimates for the g- and vg-points of an

13 Instruktion för kravgränsmöte. An instruction issued by DEM to be strictly followed by the appointed leader of the meeting of the local standard setting committee. 14 A modification of the Angoff Method on cut-off scores and judgment consensus. See Angoff (1971)

Page 90: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

80

item. With regard to a polytomously scored item the committee members must give an estimate of the average proportion of the maximum g- and vg-score for the particular item. The estimates for all the items in the test are recorded individually and collected to be sent to DEM. We can then a two-dimensional cut-score.

In a meeting with the test constructors in the project group at the department the final decision about the cut scores is made. All records are placed on the table and the decision is taken after an analysis of the data and again consider-ing grading criteria and goals of the particular course.

Reporting

The results from the national tests are reported with the purpose of supplying the National Agency for Education and the institutions that produce the tests with data needed for various purposes. The NAE are supposed to provide feedback to the schools and to provide information about the outcome of the school system. The constructors of the tests need data for the further develop-mental work to improve the tests and data for the research within the field of assessment of knowledge. For these purposes results are collected from all the students taking the test in one sixth of the schools all over the country that are reported to participate in the particular test.

The ambition during the first years of the project was to report psychometrical and statistical data for the tests as well as to discuss the results based on the qualitative characteristics of individual items. The average proportion of points (p-value) as well as the parameters a and b for a two parameter logistical regres-sion function15 for different groupings of all test takers were determined. A few reports have been published comprising qualitative and quantitative observa-tions connected to specific tasks that are accounted for and presented16. How-ever, the present regulations that the tests are subject to secrecy during a period of ten years, confines the possibility to give feedback to the schools in a way that is likely to catch the teachers’ interest. During the last years the feedback has been provided in a report17 common for the NCTs in all subjects. Due to

15 1)( )1()( −−−+= ii bai eP θθ which gives an estimate of the probability P that a student with

the total score θ will answer the item i correctly. a is a measure of the discrimination power of the item and b is an estimate of the difficulty. See Hambleton & Swaniminathan (1985) 16 http://www.umu.se. /edmeas/publikationer/index.html 17 See for example Skolverket (2002). Gymnasieskolans kursprov

Page 91: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

81

the secrecy and an applied policy by the NAE to present all subjects uniformly the reports have very much been confined to a matter of presenting the result-ing distribution of the proportions of students vs. grades for the different programmes and other subdivisions of the students.

References Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.)

Educational measurement (2nd ed. pp 508-600). Washington, DC: American Council in Education.

Airasian, P. A. (2001). Classroom Assessment Concepts & Applications. New York: McGraw-Hill

Berk, R. A. (1986). A consumers guide to setting performance standards on criterion-referenced tests. Review of Educational Research, 56, 137-172.

Hambleton, R.K. & Swaminathan, H. (1985). Item Response Theory. Principles and Applications. Boston: Kluwer-Nijhoff Publishing

Nyström, P. & Lindström, J. O. (1999) An Evaluation of a Method for Standard Setting in the National Mathematics Tests. Proposal for Paper session at the 8th EARLI Conference 1999. Enheten för pedagogiska mätningar, Umeå universitet.

Palm, T (2002). The Realism of Mathematical School Tasks – Features and Consequences. Doctoral Thesis No 24. Department of Mathematics, Umeå University

Palm, T., Bergqvist, E., Eriksson, I., Hellström, T., Häggström, C.-M. (2004). En tolkning av målen med den svenska gymnasiematematiken och tolkningens konsekvenser för uppgiftskonstruktion. PM Nr 199. Department of Educatonal measurement, Umeå: Umeå University.

Skolverket (1994). The 1994 Curriculum for the non-compulsory school system (PDF-file), http://www.skolverket.se/english/index.shtml

Skolverket (2000). Programme manual Gy2000:19 (PDF-file), http://www.skolverket.se/english/index.shtml

Page 92: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

82

Appendix

National test in mathematics course C spring 2002 (Syllabus 2000)

Page 93: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

83

Page 94: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

84

Page 95: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

85

Page 96: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

86

Page 97: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

87

Page 98: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

88

Page 99: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

89

Page 100: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

90

Page 101: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

91

Page 102: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

92

Page 103: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

93

Page 104: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

94

Page 105: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

95

Page 106: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

96

Page 107: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

97

Page 108: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

98

Page 109: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

99

Syllabus and grading criteria for mathematics

(Excerpts from the curriculum documents)

Mathematics

Aim of the subject

Upper secondary education in mathematics builds further on knowledge corre-sponding to that attained by pupils in the compulsory school by broadening and deepening the subject. The subject aims at providing a knowledge of mathematics for studies in the chosen study orientation and for further studies. The subject should provide the ability to communicate in the language and symbols of mathematics, which are similar throughout the world.

The subject also aims at pupils being able to analyse, critically assess and solve problems in order to be able to independently determine their views on issues important both for themselves and society, covering areas such as ethics and the environment.

The subject aims at pupils experiencing delight in developing their mathe-matical creativity, and the ability to solve problems, as well as experience something of the beauty and logic of mathematics.

Goals to aim for

The school in its teaching of mathematics should aim to ensure that pupils:

1. develop confidence in their own ability to learn more mathematics, to think in mathematical terms, and use mathematics in different situations,

2. develop their ability to interpret, explain and use the language of mathematics, its symbols, methods, concepts and forms of expression,

3. develop their ability to interpret a problem situation and formulate this in mathematical terms and symbols, as well as choose methods and aids in order to solve problems,

4. develop their ability to follow and reason mathematically, as well as present their thoughts orally and in writing,

5. develop their ability with the help of mathematics to solve on their own and in groups problems of importance in their chosen study orientation, as well as interpret and evaluate solutions in relation to the original problem,

6. develop their ability to reflect over their experiences of concepts and methods in mathematics and their own mathematical activities,

Page 110: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

100

7. develop their ability to work in a project and in group discussions work with the development of concepts, as well as formulate and give their reasons for using different methods for solving problems,

8. develop their ability to design, fine-tune and use mathematical models, as well as critically assess the conditions, opportunities and limitations of different models,

9. deepen their insight into how mathematics has been influenced by people from many different cultures, and how mathematics has developed and continues to develop,

10. develop their knowledge of how mathematics is used in information technology, as well as how information technology can be used for solving problems in order to observe mathematical relationships, and to investigate mathematical models.

MA1203 – Mathematics C

100 points established 2000-07 SKOLFS:2000:5

Objectives

Goals that pupils should have attained on completion of the course

Pupils should:

G1. be able to formulate, analyse and solve mathematical problems of importance for applications and their selected study orientations…

G2. with an in-depth knowledge of concepts and methods learned in earlier courses R2. be able to interpret and use logarithms and powers with real exponents, and be

able to apply these in solving problems A7. be able to set up, simplify and use polynomial expressions, as well as describe

and use the properties of some polynomial functions and power functions A8. be able to set up, simplify and use rational expressions as well as solve

polynomial equations of higher powers through factorisation R3. be able to use mathematical models of different kinds, including those which

build on the sum of a geometric progression A6. be familiar with how computers and graphic calculators can be used as aids,

when studying mathematical models in different application areas D1. be able to explain, illustrate and use the concept of changing coefficients and

derivatives for a function, as well as use these to describe the qualities of a function and its graphs

D2. be able to identify the rules of derivation for some basic power functions, sums of functions, as well as simple exponential functions, and in connection with this describe why and how the number e is introduced

Page 111: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

101

D3. be able to draw conclusions from a function’s derivatives, and estimate the value of the derivative when the function is given by means of a graph

D4. be able to use the relationship between a function’s graph and its derivatives in different application contexts with and without aids for drawing graphs.

Grading criteria

Criteria for Pass G1. Pupils use appropriate mathematical concepts, methods, models and procedures

to formulate and solve problems in one step. G2. Pupils carry out mathematical reasoning, both orally and in writing. G3. Pupils use mathematical terms, symbols and conventions, as well as carry out

calculations in such a way that it is possible to follow, understand and examine the thinking expressed.

G4. Pupils differentiate between guesses and assumptions from given facts, as well as deductions and proof.

Criteria for Pass with distinction V1. Pupils use appropriate mathematical concepts, methods, models and procedures

to formulate and solve different types of problems. V2. Pupils participate in and carry out mathematical reasoning, both orally and in

writing. V3. Pupils provide mathematical interpretations of situations and events, as well as

carry out and present their work with logical reasoning, both orally and in writing.

V4. Pupils use mathematical terms, symbols and conventions, as well as carry out calculations in such a way that it is easy to follow, understand and examine the thinking they express, both orally and in writing.

V5. Pupils demonstrate accuracy concerning calculations and solutions to different kinds of problems, and use their knowledge from different fields of mathematics.

V6. Pupils give examples of how mathematics has developed and been used throughout history, and the importance it has in our time in a number of different areas.

Page 112: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

102

Criteria for Pass with special distinction M1. Pupils formulate and develop problems, choose general methods and models for

problem solving, as well as demonstrate clear thinking in correct mathematical language.

M2. Pupils analyse and interpret the results from different kinds of mathematical reasoning and problem solving.

M3. Pupils participate in mathematical discussions and provide mathematical proof, both orally and in writing.

M4. Pupils evaluate and compare different methods, draw conclusions from different types of mathematical problems and solutions, as well as assess the reasonableness and validity of their conclusions.

M5. Pupils describe some of the influences of mathematics in the past and present on the development of our working and societal life, as well as on our culture.

Page 113: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

103

Description of items in the C-course test in mathematics spring 2002 in relation to the topics and assessment criteria in the syllabus.

g vg ¤po- po- Gen. Dif & integral Pass Pass with distinction high distinction

Nr ints ints 1 4 2 3 6 7 8 1 2 3 4 1 2 3 4 1 2 3 4 5 6 1 2 3 4 5

1a 1 0 x x x1b 1 0 x x x2 2 0 x x x3 1 0 x x4 1 0 x x5 0 3 x x x x x x x

6a 0 1 x x x x x x x x6b 0 2 ¤ x x x x x x x x x x x7 1 0 x x x

8a 1 0 x x x8b 2 0 x x x9 2 0 x x x

10 2 1 x x x x x x x x11a 1 0 x x11b 1 0 x x x11c 0 1 x x x x11d 2 0 x x x12 2 1 x x x x x x x x x

13a 0 2 x x x x x x13b 0 2 x x x x x14 0 2 x x x x x x x x15 3 4 ¤ x x x x x x x x x x x x x x

Σ 23 19 11/130 8/3 4/3

Pass withaRithm Algebra

Assessment criterionTopic

Page 114: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

104

General assessment schedule based on three aspects for extended response items

The assessment concerns Qualitative levels

Lower Higher

Choice of method and accomplish-ment

To what extent the student can inter-pret a problem situation and solve different types of problems. How complete and how well the student use meth-ods and models that are relevant to solve the problem..

The student solves problems or parts of problems of simple routine character and thereby shows a basic understand-ing for concepts, methods and procedures.

The student solves different types of problems and thereby shows a good understanding of concepts, meth-ods and procedures and reliable calcula-tions. The student provide mathematical interpretation of situations and use mathematical models.

The student can develop tasks and use relevant proce-dures. The student can use general meth-ods and models for problem solving.

Page 115: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

105

Mathematical reasoning

Occurrence and quality of valuation, ana-lyse, reflections, proofs and other forms of mathe-matical reasoning .

The student follows and understands mathematical reasoning both orally and in writ-ing. The student draws conclusions from examination of one or a few cases.

The student carries out logical mathe-matical reasoning, both orally and in writing. The student draws conclusions from a larger number and/or well chosen cases.

The student takes part of others’ arguments and present from these her/his own mathematically based ideas and explanations. The student evalu-ate and compare different methods and analyse and interpret the results from different kinds of mathematical problem solving. The student draws conclusions from general reasoning and can accomplish deductions and mathematical proofs.

Presentation and mathematical language

How clear, distinct and complete the work of the student is and how well the student uses mathematical terms, symbols and conventions..

The presentation is possible to follow in parts even if the mathematical language is poor and occasionally incorrect.

The presentation is easy to follow and understand. The mathematical language is accept-able.

The presentation is well structured, complete, and clear. The mathe-matical language is correct and appro-priate.

Page 116: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

106

Page 117: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

107

Test Quality Considerations Torulf Palm, Umeå University

Introduction

Generally, in the test development and test evaluation different criteria for the quality of a test may be used. Furthermore, different emphasis may be put on the chosen criteria depending on the interpretation of the main purposes of the test. Such interpretations may, for example, be based on the interpretation of the as-signment given to the test developers.

In the work with the Swedish national tests and test banks at the Department of Educational Measurement, Umeå University, considerations about aspects of test quality are made both in the test developmental work before the test is carried out, and in the evaluation of the test after the test has been carried through. In these processes curriculum and syllabi documents, solution rates, actual student solutions and answers to teacher questionnaires are included in the analysis. Below is a brief description of a number of aspects of test quality that are focused in the processes. In addition, the way these criteria are addres-sed is briefly described. The criteria are similar to those proposed by for examp-le Linn, Baker & Dunbar (1991) and Linn & Herman (1997).

Quality criteria

• Alignment

For each task in a test the content and aspects of knowledge intended to be tested are founded on an interpretation of the syllabus and grading criteria. All of the tasks are classified in categories corresponding to this interpretation. This means that each task is categorised concerning, for example, which grading criteria and which objective in the syllabus it corresponds to. The distribution of the tasks over these categories is then compared to a distribution, decided before hand, which is judged to be consistent with the interpretation of the steering documents. The categorisa-tion is used in the test development process and some of the categorisations are attached to the tests in the “information to teachers” material.

Page 118: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

108

The categories include, for example, the category Cognitive content, which includes the possibilities of classifying a task as Modelling, Reasoning, Com-munication, Concept understanding, and Algorithms. For an example of a classification, see the classification of the C-course test in Figure 1 in the con-tribution by Jan-Olof Lindström in this volume.

• Reliability

The developmental process includes a qualitative approach to reliability, which focuses on the awareness of the influences of tasks and raters on the reliability of a test. Pretests are used to collect information about different student solutions, the students’ and teachers’ experiences with the tasks, and the consistency of the marking schemes to facilitate interrater agreement. In addition, some after-test evaluation studies of the reliability of the tests have been conducted.

• Fairness

The groups involved in the test development process discuss, and make choices accordingly, the risk that some groups of students may have disadvantages due to their group belonging when working with some of the tasks. Decisions about tasks involve considerations about possible undue differences in performance, primarily due to gender, native language, and study programme. Suspicions of such undue differences may arise from earlier experiences from test results or classroom teaching. In the evaluations after a test is carried out solution rates are regularly analysed for gender differences and differences between students’ from different study programmes. Sometimes also differences in performance due to native language are analysed.

• Language/logic

The language used in the tasks is discussed by both the consultative group involved in the test development process and by the people at the department. Opinions are also collected from the teachers involved in the pretesting.

• Comparability

A group at the department decides the cut scores for the different test grades. They do so based on information from different sources. The stan-dard setting groups make their judgements based on the grading criteria, both with a holistic view of the test and with a repeated Angoff method. The consultative groups, which also give suggestions to the cut scores, also

Page 119: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

109

have information from the pretests, both concerning solution rates and solution methods. After the tests have been carried out the teachers’ opin-ions about the cut scores are collected from the teacher questionnaire.

• Cost and efficiency

The cost, in money and workload, and possible positive and negative effects of different features of the tests are continuously discussed at the department and in the teacher groups involved in the development of the tests. This may, for example, concern the teachers’ workload due to marking and the test develop-ers’ workload due to the requirements of developing complex tasks and marking schemes. The positive effects should be worth the cost. Information about the teachers’ opinions are collected from the teacher questionnaire and discussed.

• Meaningfulness

A concern for the different test development groups is that the tests should be such that the teachers and the students experience the tests as meaningful. This means, for example, that the teachers consider the information gathered from the test helpful in their grading process, and that the students experience the knowledge and situations that are tested as worthwhile educational outcomes. The latter may be enhanced by truly authentic tasks describing out-of-school situations experienced as meaningful and familiar to the students.

• Consequences

In the test development process decisions about, for example, response format and aspects of knowledge tested are influenced by judgements of likely influ-ences on teaching and learning. Interpretation and implementation of attain-ment targets, content, and grading criteria is of special concern. After the tests have been carried out information about the teachers’ beliefs of the influences of the tests are collected from the teacher questionnaire.

References Linn, R.L., Baker, E.L., & Dunbar, S.B. (1991). Complex performance-based

assessment: Expectations and validation criteria. Educational researcher, 20, 15-21.

Linn, R.L., & Herman, J.L. (1997). Standards-led assessment: Technical and policy issues in measuring school and student progress (CSE Technical Report No. 426). Los Angeles: UCLA National Center for Evaluation, Standards, and Student Testing (CRESST).

Page 120: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

110

Page 121: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

111

The Test Bank in Biology - Assessment of Experiments and Field Investigations Gunnel Grelsson & Christina Ottander, Umeå University

Background

The development of an IT-based test bank in biology in Sweden was initiated in late 1999 by the Department of Educational Measurements on commission by the National Agency of Education. The test bank was opened in November 2001 and is available for biology teachers in upper secondary school.

The test bank provides items for test construction covering the course Biology A at upper secondary school. Traditionally, in Sweden the teachers have had the freedom to interpret the grading criteria in biology, and many other subjects, on their own. Within a class or a school, this system does not raise any problem. However, as there have never been any centralized tests, supporting the teachers in the grading in biology, as in e.g. mathematics, no concordant test culture in biology has developed. The grading comparability between different regions and different schools within the regions, in Swe-den therefore has been questioned. The intention in this test bank project, consequently, is to provide pre-tested items of high quality with scoring directions that will support the teachers in their assessments. Hopefully the effect will be a grading with higher comparability.

The development of the test bank

In the beginning of the project biology tests from different schools all over the country, were collected and analysed with respect to content, item format and process (cf. Appendix 1). The results pointed out a need for the devel-opment of items that promote higher order thinking. These results are supported by Bol & Strage (1996) who found that although teachers wanted their students to develop higher order study skills, the teachers assessment practice did not support this goal. Thus, the decision was to start with extended response items, which were judged to be the mostly needed for the teachers in an initial phase of the test bank.

Page 122: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

112

Item construction

The item construction is made in cooperation with teachers. Five groups, each consisting of up to four biology teachers at schools and universities (Departments of Teacher Education) all over the country are involved in the construction process.

All material constructed within one group is reviewed by one of the other groups according to a predetermined schedule. Revision is thereafter made with respect to the comments from the reviewers and the member of the Department of Educational Measurement who is responsible at the test bank. Next step is field-tests followed by analysis of results18, student answers and comments from the teachers who have carried out the pre-tests. The items which show sufficient quality in a certain test are then accepted (maybe after a second minor revision) for the test-bank. Rejected items might be subject to a major revision and a new field-test. During the whole process consideration is taken to the curriculum and syllabus. All items are categorized (see Appendix 1) from different perspec-tives. While constructing a test, a compilation of the categorization of the included items gives an overview of the characteristics of the test. Further more, the categorization is the base for the search tool in the test bank.

Goals in biology

A revised syllabus for upper secondary school was established by the govern-ment in 2000. In Table 1 the general goals and grading criteria for Biology A are presented. An ideal test bank should include many items completely cover-ing the goals and the grading criteria. As mentioned, there is no tradition of centralized tests in biology in Sweden and no previously used material has been available as a foundation for the test bank in biology.

18 p-values of the groups of students who have qualifications for each certain goal.

Page 123: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

113

Table 1. Application of goals and grading criteria for Biology A in Swedish upper secondary school, SkolFs: 2000:19.

Assessment concerns Goal

Grade: Pass (G) Grade: Pass with distinction (VG)

Grade: Pass with special distinction (MVG)

Use of concepts, models and theo-ries

Presence of and quality of evalua-tion and analysis and ability to use biological concepts, models and theo-ries on environ-mental phenomena.

(Be)

♦The student uses introduced biological concepts and models to describe biological phenom-ena and relation-ships.

♦The student uses introduced biological concepts, models and theories to explain biological phenomena and relationships and apply them to everyday life situations.

♦The student elucidates and discusses questions and hypotheses about environmental phenomena on the basis of biological theories and mod-els.

♦The student com-pares and evaluates the validity of differ-ent models and theories

♦The student analy-ses and discusses new questions and hypotheses about environmental phenomena and reflects about their validity on the basis of biological theories and models.

♦The student inte-grates knowledge from different areas and relates this knowledge to theo-ries.

Scientific way of working

Investigative task. To what extent the student can design, carry out, interpret and evaluate an investigative task.

(Ve1)

Different ways of describing nature

(Ve2)

♦ The student accomplishes investigative tasks according to instruc-tions and evaluates and discusses the results under supervision.

♦ The student distinguishes between scientific and other ways of describing reality.

♦ The student participates in the design and accom-plishment of an investigation and interprets the results on the basis of introduced theories and proposed hypotheses.

♦ The student distinguishes between scientific and other ways of describing reality.

♦ The student practices a scientific way of working, interprets results and evaluates the validity and reasonableness of conclusions based on introduced theo-ries and proposed hypotheses.

♦ The student identifies differences between scientific and other ways of describing reality.

Page 124: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

114

The items constructed so far are intended for written examinations covering the first goal, “Use of concepts, models and theories” (Be) (Table 1 and 2). How-ever, none of the skills involved in “Scientific way of working” (Ve1 and Ve2), are covered by this kind of tests, mainly because they usually are connected with experiments and field investigations. Both Comber & Keeves (1973) and Bennett et al. (2001) found that skills needed for carrying out experiments are assessed to only a very limited extent in written examinations.

Table 2. An example of categorization of some items. (For definitions see Appendix 1)

Biology Coarse AScore: Grade and general goal Mean score

Item title Goa

l R

espo

nse

Pro

cess

Tim

e

Ope

nnes

s

Per

spek

tive

Fig

ure

Fac

ilitie

s

Key w ords G/B

e

G/V

e1

G/V

e2

VG

/Be

VG

/Ve1

VG

/Ve2

MVG

/Be

MVG

/Ve1

MVG

/Ve2

Tot

al

G VG

MVG

Buttercup 4 S Sr 1-5

4Categories, naming 2 2 1 1,

8

2

Grow th hormone 8 E Cr 5-10

4 I 1 2 1 4 1,9

2,7

3,4

Decomposers a) 3 E Cr 5-10

4 En Food w eb 1 1 2 1,3

1,3

1,5

b) 3 E P 5-10

2 Food w eb 1 1 2 0,2

1 1

Certainly, the goal that the students should develop the ability that is cate-gorized as the “Scientific way of working” (Table 1), is as important as the other goals. However, a formalized assessment model for this has not been practised in any great extent in Sweden. Teachers have made the assessment of this skill in a quite informal way, if at all they have put any emphasis on it when grading the students. This paper will continue with a discussion of the problem with assessing the more practical skills that this goal covers. The first question is however: ‘What underlies this goal?’ An analysis of the goal based on literature (cf. Tamir et al. 1992 a & b, Lindström et al. 1999, Solano-Flores et al. 1999, IB Diploma programme guide 2001) gave the result that at least five skills are involved in the goal and are found relevant for laboratory and field experiments covering the first goal “Investigative task” (Ve1) of the main goal “Scientific way of working” in Table 1. These are defined and described shortly in Table 3. This implies that the teacher should take many factors into account when assessing the practical skills of

Page 125: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

115

every single student. The situation in the laboratory or in the field would however make it difficult for a single teacher to carry out the assessment. Usually there are at least 15-20 students whose work should be assessed at the same time.

Table 3. Analysis of skills in laboratory and field experiments that should be as-sessed in biology.

Skill Description

1. Design experiments Question/problem

- hypothesis or prediction

- define variables

Plan experiments:

- studies of literature

- theories, ideas

- design procedure and experiment

- choice of methods

2. Carry out experiments - use instructions

- measure (experiment & control)

- choice and use of equipment

- collection and documentation of data

- maintain order

- observe safety procedures

3. Interpret results - analysis of results

- interpretation of results

- analysis of limitations

- analysis of assumptions

4. Evaluation Analysis of

- results

- methods

- sources of error

- limitations and assumptions

Conclusions and evaluations

5. Presentation (report or performance)

Description of

- question/hypothesis

- methods

- results

Discussion

Conclusions

Synthesis

Page 126: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

116

A further examination of the skills, however, showed that not all of them need to be assessed simultaneously with the student’s work. The only skill that always has to be assessed simultaneously with the student’s work is the second one, the skill to carry out experiments. Especially the use of equipment, measuring skills and the collection of data must be assessed during the student’s work. The design of experiments could be assessed from a written description made by the student; if not the whole planning process has to be assessed. The skills 3, 4 and 5 (Table 3) can all be assessed from the presentation of the results of the ex-periment carried out by the student (i.e. the report of the experiment).

A development of tests covering the discussed goals is planned but this work has to be carried out with care. If the test entails more work for the teachers, they will place less emphasis on other goals, or not use the test at all. None of these scenarios are desirable. The assessment criteria for tests of skills in carrying out experiments must therefore be as clear and simple as possible. Only such tests will be appropriate to include in the test bank in biology.

References Bennett, J. & Kennedy, D, (2001). Practical work at the upper high school level: the

evaluation of a new model of assessment. International Journal of Science Education, Vol 23, 97-110.

Bol, L. & Strage, A. (1996). The contradiction between teachers’ instructional goals and their assessment practices in high school biology courses. Science Education, Vol. 80, 145-163.

IB Diploma programme guide. Biology, February 2001.

Comber, L.C. & Keeves, J.P. (1973). Science education in nineteen countries. New York. Wiley and Sons.

Lindström, L., Ulriksson, L. & Elsner, C. (1999). Utvärdering av skolan avseende läroplanernas mål (US98). Portföljvärdering av elevers skapande i bild. Stockholm: Skolverket.

SKOLFS 2000:19

Solano-Flores, G., Jovanovic, J., Shavelson, R.J. & Bachman, M. (1999). On the development and evaluation of a shell for generating science performance assessments. International Journal of Science Education, Vol. 21, 293-315.

Tamir, R., Doran, R.L., & Chye, Y.O. (1992a). Practical skills testing in science. Studies in Educational Evaluation, Vol. 18, 263-275.

Page 127: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

117

Appendix 1

Categorization of biology items

Goals, Biology A

Pupils should

1 be able to plan and carry out field studies and experimental investiga-tions, interpret these, as well as present their work orally and in writing

2 have a knowledge of Man’s relationship to nature from the perspective of history of ideas.

3 have a knowledge about the structure and dynamics of ecosystems.

4 have a knowledge of the principles for classification of organisms as well as how this is done.

5 have a knowledge of the importance of the behaviour or organisms for survival and reproduction.

6 have a knowledge of the theories of natural science concerning the ori-gins and development of life.

7 have a knowledge of the structures of the gene pool, as well as under-stand the relationships between this and individual characteristics.

8 have a knowledge of gene technological methods and their applications, as well as be able to discuss the opportunities and risks they pose from an ethical perspective.

Response

Response concerns the type of presentation the item asks for. Response only has the type of answer in view; it does not concern the content in the answer.

• Multiple choice

• Short response

• Extended response

• Laboratory experiment

• Essay

Page 128: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

118

Project

Process

This criterion is based on the process of thinking required to answer the item.

• Simple reproduction – the pupil shows an ability to reproduce such knowledge as facts, simple relationships, and principles she/he has learnt.

• Complex reproduction – the pupil shows an ability to apply and com-bine relationships, reproduce reasoning she/he has learnt as well as re-produce them with her/his own words.

• Problem solving – The pupil shows an ability drawing own conclusions by logical reasoning based on given data within situations new for the pupil.

Estimated time consumption

An estimate of the time it will take for an average pupil to answer/solve the item.

• < 1 min

• 1-5 min

• 5-10 min

• 10-15 min

• 15-30 min

• >30 min

Openness

1 An open item formulated so that it can be interpreted in several differ-ent ways and the pupil’s logical reasoning will lead to a solution and answer depending on that interpretation.

2 An item with an unambiguous question which can give several answers depending on which type of example the pupil chooses as the base of his/her reasoning.

3 An item with a question which can be answered with different types of reasoning which lead to the same answer/solution at the end.

4 A closed item is formulated in such a manner that it is clearly evident how the reasoning should be carried out and only one answer is correct.

Page 129: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

119

Perspective

Different perspectives that are illustrated in an item.

• Historical

• Environmental

• International

• Ethical

Figures

• No Figures

• Illustrations – do not provide new information but make the item clearer or leads the thoughts on appropriate lines

• Figure or Table – necessary information which has to be interpreted before the item can be answered correctly

Facilities

• No facilities

• Equipment – e.g. field or laboratory facilities

• Literature – e.g. species keys or field guides

• Computer/video – e.g. programs, Internet

Scoring

The assessment of a solution and answer of an item is based on the criteria for the scoring levels pass (G), pass with distinction (VG) and pass with special dis-tinction (MVG).

1 The assessment of an answer is also classified concerning:

2 Use of concepts, models and theories (Be)

3 Scientific way of working when examining tasks (Ve1)

4 Scientific way of describing nature (Ve2)

Page 130: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

120

Page 131: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

121

Implementing a Creative Competence19 Jesper Boesen, Umeå University

Introduction

Since 1994 there is a new curricula and new syllabuses in the Swedish upper secondary school. This reform included not only new policy documents but also a new grading system, a new national test system and a shift from a more centrally directed system to a system with a higher degree of local influence.

Very shortly, the “new” curricula LPF94 (Swedish National Agency for Educa-tion, 2001a, 2001b), represents a shift in view of knowledge from procedural knowledge to conceptual knowledge20 (i.e., when compared to the old curricula Lgy 1970). What is believed as being important mathematical competencies has changed. It is not only which content that is covered, it is also what kind of competences that are required that have come in focus. Some key words that characterize these competences could be, reasoning, modelling, generalizing, communicating and so on. These might be competences that by mathemati-cians and math educators always have been seen to be important features, but which not earlier have been so clearly described in the curricula and the sylla-buses (Movitz et al., 2002). One of these competences is of specific interest in this proposal, namely the competence (creative) problem solving, which can be seen as the ability to solve non-routine tasks.

19 This proposal is mainly written to support a discussion held at the SweMas conference and is in an ongoing writing phase, parts and references might therefore not be complete. 20 This ”shift” is to be described further in the thesis.

Page 132: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

122

Hypothesis

There are indications, both from a pilot study (Boesen, 2003) and from related research at the department (Lithner, 2001), on a discrepancy between at least some of the following elements:

A. How the competence creative problem solving is described in the pol-icy documents

B. The concretisation of this competence in the National Tests

C. How this competence is dealt with in the classroom (Learning envi-ronment)

D. The actual outcome in student performance concerning this compe-tence

It is not claimed that there necessarily are discrepancies between all of these elements; the hypothesis mainly concerns the difference between A and D.

Planned framework

The competence, creative problem solving, is so far undefined, and a precise definition is a central part of the ongoing work. To get an idea of what is intended to be studied one could say that it could be described in a Schoenfeld manner (Schoenfeld, 1985). Another reference that catches one important aspect (or component) of this competence is (Lithner, 2003) who discusses problems and routine tasks and different qualities of reasoning. Yet another reference of special interest concerning how to describe this competence could be (Niss & Jensen, 2002), although their description of the competence might not meet the requirements of an analysing tool. They describe the competence ‘Problembehandlingskompetence’ as:

Denne kompetence består dels i at kunne opstille, dvs. detektere, formulere, af- grænse og præcisere forskellige slags matematiske problemer, “rene” såvel som “anvendte”, “åbne” såvel som “lukkede”, dels i at kunne løse sådanne matematiske problemer i færdigformuleret form, egnes såvel som andres, og, om fornødent eller ønskeligt, på forskellige måder.

Kommentar Et (formuleret) matematisk problem er et særlig type matematisk spørgsmål, nemlig ét hvor en matematisk undersøgelse er nødvendig for besvarelsen. Sådan set kunne også spørgsmål, som kan besvares alene ved hjælp af (få) specifikke rutineoperationer, falde ind under begrebet “problem”. Sådanne spørgsmål, som for den der skal løse dem, kan besvares ved aktivering af rutinefærdigheder, henregner vi imidlertid ikke

Page 133: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

123

under matematiske problemer i denne forbindelse. Derved bliver begrebet “matematisk problem” ikke absolut, men relativt til den person som stilles over for det. Det, som for én person kan være en rutineopgave, kan for en anden være et problem, og omvendt. Ikke ethvert matematisk spørgsmål opstiller et matematisk problem. For eksempel er “Hvad betyder det, når der står 0 i 406?” ikke et problem som kræver en matematisk undersøgelse, men er et spørgsmål til matematisk begrebsforståelse og sprogbrug. Men da mange spørgsmål faktisk opstiller et problem, er det at kunne formulere matematiske problemer intimt forbundet med det at kunne stille matematiske spørgsmål og have blik for typer af svar på dem, jf. tankegangskompetencen. Men de to kompetencer er altså ikke sammenfaldende. Det at kunne løse et matematisk problem indgår ikke i tankegangkompetencen. Omvendt indgår fx tankegangskompetencens skelnen mellem definitioner og sætninger ikke i sig selv i problembehandlingskompetencen, om end denne skelnen i praksis kan være en vigtig forudsætning for denne kompetence. Grænsen mellem behandlingen af anvendte matematiske problemer og aktiv matematisk modelbygning er flydende. Jo mere det er nødvendigt at tage specifikke træk ved de elementer der indgår i problemstillingen i betragtning, jo mere er der tale om modelbygning. Det at kunne detektere og formulere matematiske problemer og det at kunne løse færdigformulerede matematiske problemer er ikke det samme. Det er meget vel muligt at kunne formulere matematiske problemer uden at være i stand til at løse dem. Man kan endda opstille problemer alene med et elementært begrebsapparat, uden at en løsning overhovedet kan skabes med dette begrebsapparat. Tilsvarende er det muligt at være en dygtig problemløser uden at være god til at finde og formulere matematiske problemer.

(Niss & Jensen, 2002)

(At this moment the rest of the competences that the above quotation refers to will be omitted in this version of the proposal, if needed a systematic descrip-tion will be added later.)

As mentioned earlier the development of a precise framework (concerning the competence) is crucial in this project and is planned to work as an analysis tool when conducting the different studies discussed later in the proposal.

Another framework is needed when setting the whole project in its context, and a first thought goes to TIMSS curricula model when it comes to dealing with implementing this competence (Robitaille et al., 1993): The TIMSS curriculum model involves three aspects: the intended curriculum, the implemented cur-

Page 134: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

124

riculum, and the achieved curriculum. “These represent, respectively, the mathe-matics and science that society intends for students to learn and how the education system should be organized to facilitate this learning; what is actually taught in classrooms, who teaches it, and how it is taught; and, finally, what it is that students have learned, and what they think about these subjects.” In the light of this per-spective one could see both component A and B as the intended curricula, C as the implemented curricula and D as the attained curricula. (Another interpreta-tion could result in A as intended, B and C as implemented and finally D as attained, this depends on your perspective on the components different statuses.)

Other terminology that could be of relevance for this project are alignment (Barnes et al., 2000; McGehee & Griffith, 2001) and high stake (Chapman & Wesley SnyderJr, 2000; McGehee & Griffith, 2001; Stephens et al., 1994) which respectively are concepts for the correspondence between the Policy Documents and the National Tests and how decisive/crucial the Tests are apprehended to be by teachers and students. (Both are assumed to be important features in this project.)

Characteristics of reform

In an article (Ross et al., 2002) describes in ten categories reform in math education, these categories could be used to describe the shift from the old to the new system and could serve as a base for defining it as a reform. The catego-ries are:21

1. Broader scope (e.g., multiple math strands with increased attention on those less commonly taught such as probability rather than an exclusive focus on numeration and operations).

2. All students have access to all forms of mathematics, including teaching complex mathematical ideas to less able students.

3. Student tasks are complex, open-ended problems imbedded in real-life contexts; many of these problems do not afford a single solution. In traditional math, students work on routine applications of basic opera-tions in decontextualized, single solution problems. […]

21 These might have to be modified to meet the Swedish systems requirements.

Page 135: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

125

4. Instruction in reform classes focus on the construction of mathematical ideas through students’ talk rather than transmission through presenta-tion, practice, feedback and remediation.

5. The teacher’s role in reform settings is that of co-learner and creator of a mathematical society rather than sole knowledge expert.

6. Mathematical problems are undertaken in reform classes with the aid of manipulatives and with ready access to mathematical tools (calculators and computers), support not present in traditional programs.

7. In reform teaching, the classroom is organized to encourage student-student interaction as a key learning mechanism rather than to discour-age it as an off-task distraction.

8. Assessment in the reform class is authentic (i.e., analogous to tasks un-dertaken by professional mathematicians), integrated with everyday events, and taps a wide variety of abilities, in contrast with end-of-week and unit tests of near transfer that characterize assessment in traditional programs.

9. The teacher’s conception of mathematics in the reform class is that of a dynamic (i.e., changing) discipline rather than a fixed body of knowl-edge.

10. Teachers in reform settings make the development of student self-confidence in mathematics as important as achievement […].

(Ross et al., 2002)

[My intention is to apply this “framework” to the Swedish Policy documents and corresponding parts as a way to describe “the shift” which by some are labelled as a paradigm shift, see e.g. PM from Skolverket, Effektstudien.]

Research on effects of reform

In the same article on reform in mathematics education 1993-2000 (Ross et al., 2002) it is assumed that the latest round of reform began with the publication (National Council of Teachers of Mathematics Commission on Standards for School Mathematics, 1989). [*This is an assumption which seems to be applicable in the Swedish settings as well.] According to Ross et al., one could identify two paths within the research during the period; one path consisted of studies concerning the effects of reform on student

Page 136: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

126

achievement, and the second focused on evidence on non-implementation and barriers to enactment. The first set of studies could be summarized by “[…] students in classrooms characterized by mathematics education reform have higher achievement on achievement measures emphasized by reformers such as problem-solving and conceptual understanding, have more positive attitudes toward the subject, and at least have no less achievement on objectives empha-sized by traditional programs such as computational efficiency”. (p. 128) Out-lining the second path, as the authors say “a lengthy catalogue”, the most important feature seems to be that teacher must be the ‘agents of change’. There are many explanations to why this might not be the case e.g. the pedagogy is not only different, but also harder to learn, teacher beliefs about mathematics, teachers lack of content knowledge, parental expectations, lack of time to cover curriculum etc.

The most powerful method for increasing implementation seems to be in-service training for teachers. Among other ways of reducing barriers to implementation are alignment of assessments and implementation through integration with technology, although there is little evidence on how the latter contributes to reform.

Aim and research questions

It is the overall aim to, as completely as possible, investigate, how this competence has been implemented in the Swedish upper secondary school. The following preliminary questions sets the scene for the study as a whole: i) What are the main characteristics? (Does the hypothesis hold?, What does the discrepancies comprise?) ii) What are the main reasons? (Why hasn’t implementation worked?) iii) What measures can be taken to improve these conditions? These questions concern the implementation of this compe-tence, not the competence it self. A comment: The focus of this thesis will (probably) be on questions i) and ii), question iii) will rather be of specula-tive nature. (Of course on the basis of results from i) and ii)).

Perhaps a stronger focus on one of the components A-D will be necessary to be able to engross to a higher degree in the main question, preliminary such a focus would be on the component B, The National Tests and their role as an implementation tool for the Policy Documents.

Page 137: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

127

Planned studies

This section contains the four main objects of study, former named elements, and some preliminary ideas of study interest.

A: The policy documents. A first step will be to analyse and compare the documents from 1970 and 1994 and to in detail describe “the transi-tion” from a procedural to a conceptual view of knowledge22. The analysis will be dependent on the chosen framework, a framework that should be applicable to all four objects of study, A-D. The main focus in this study is to analyse how the competence “creative problem solv-ing” is described in the policy documents, and what is the desired out-come/goals? Further studies that can be of interest are interviews with the authors of the policy documents, analysis of background material etc.

B: The National Tests. The National Tests are assumed to be one of a number of “instruments/factors” that influences “the classroom mathematics”. Without entering a discussion of the effects of testing, just referring to them, a number of scholars claim centrally adminis-tered high stake tests to be important levellers for change e.g. (Barnes et al., 2000; McGehee & Griffith, 2001; Stephens et al., 1994) The tests role in this perspective will not be further discussed here, but in this proposal the tests is one of the main study objects and one aim is to classify the items on the test/s (some or all, selection?) in order to iden-tify those that requires the desired competence. Another relevant ques-tion is how well are the tests aligned with the policy documents? (McGehee & Griffith, 2001; Pettersson & Kjellström, 1995). [Special focus will probably be devoted to this component, i.e., this part could constitute the centre of attention in the proposal.]

C: In the classroom/Learning environment (for definition see (Lithner, 2001) page 26). The main question here is; how is teaching organ-ised/conducted, in relation to this competence? Are there tasks in the textbooks that require the competence? So far this part of study interest is planned to be of literature survey style, except for the textbook analy-sis. There are at this moment no classroom studies planned.

22 (The curricula from 1970 included the syllabuses but the curricula from 1994 separates between the curricula and the syllabuses and from 2000 there are new syllabuses belonging to the curricula from 1994.)

Page 138: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

128

D: This part relates to what students actually have attained in relation to the items classified in B. Data from student results on the tests will be analysed (on those items classified as requiring the problem-solving competence.) One study will focus on what students actually do when solving these items, regardless of what is required. Videotapes of 8 stu-dents working with these tasks are available. (General solving rates will be analysed.) Additional data can (perhaps) be found in both the TIMSS 2003 and PISA studies for comparative studies (this requires of course that those items that relates to this competence are classified by the same framework).

Difficulties and discussion questions

The first question that arises is perhaps, why is this an interesting project at all? Isn’t there already enough evidence that reforms in policy document often results in only new papers and nothing in the classroom? (That rather than Policy Documents changing practice, practice changes the documents in order to meet more traditional values.)

As I see it this question is highly relevant mainly because of the importance of the desired competence. In my point of view, the problem solving competence perhaps is the most central competence. Another key issue is that this question hasn’t been thoroughly investigated in the Swedish system.

Here follows some questions that can serve as a basis for a discussion:

• Is it a reasonable research project: a) Would it be valuable to get (par-tial) answers to the research questions? b) Is it in your opinion possible to obtain such answers?

• What are the main difficulties in this proposal?

• One of my main difficulties lies in the component C, at this stage I plan to conduct as much of this part as possible by reviewing other re-search, but, there are not very much research done in the Swedish sys-tem. This might lead to an exclusion of this part… or is it an all to cen-tral part?

• Suggestions for framework?

• References?

Page 139: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

129

References Barnes, M., Clarke, D., & Stephens, M. (2000). Assessment: The engine of systemic

curricular reform? Journal of Curriculum Studies, 32(5), 623-650.

Boesen, J. (2003). Teacher beliefs on effects of national tests in mathematics for the swedish upper secondary school.Unpublished manuscript, Umeå.

Chapman, D., W, & Wesley SnyderJr, C. (2000). Can high stakes national testing improve instruction: Reexamining conventional wisdom. International Journal of Educational Development, 20(6), 457-474.

Lithner, J. (2001). Undergraduate learning difficulties and mathematical reasoning. Roskilde Universitetscenter, Roskilde.

Lithner, J. (2003). A framework for analyzing qualities of mathematival reasoning: Version 2. (No. 3, 2003). Umeå: Department of Mathematics.

McGehee, J. J., & Griffith, L. K. (2001). Large-scale assessments combined with curriculum alignment: Agents of change. Theory into Practice, 40(2), 137-144.

Movitz, L., Emanuelsson, G., Johansson, B., Kilborn, V., & Pettersson, A. (2002). Kunskapsöversikt och bibliografi i matematik (Kunskapsöversikt för Skolverket). Göteborg.

National Council of Teachers of Mathematics Commission on Standards for School Mathematics. (1989). Curriculum and evaluation standards for school mathematics. Reston, Va.: The Council.

Niss, M., & Jensen, T. H. (2002). Kompetencer og matematiklaering (competencies and mathematical learning): Uddannelsestyrelsens temahaefteserie nr. 18-2002, Undervisningsministeriet.

Pettersson, A., & Kjellström, K. (1995). The curriculum's view of knowledge transferred to the first national course test in mathematics. Stockholm: PRIM, LHS.

Robitaille, D. F., Schmidt, W. H., Raizen, S., McKnight, C., Britton, E., & Nicol, C. (1993). Curriculum frameworks for mathematics and science. Timss monograph no. 1.: Canada; British Columbia.

Ross, J. A., McDougall, D., & Hogaboam-Gray, A. (2002). Research on reform in mathematics education, 1993-2000. Alberta Journal of Education Research, 48(2), 122-138.

Schoenfeld, A. H. (1985). Mathematical problem solving. Orlando: Academic Press.

Stephens, M., Clarke, D., & Pavlou, M. (1994, 5-8 July). Policy to practice: High stakes assessment as a catalyst for classroom change. Paper presented at the 17. annual conference of the Mathematics Education Research Group of Australasia (MERGA): Challenges in mathematics education - constraints on construction, Lismore, Australia.

Page 140: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

130

Swedish National Agency for Education. (2001a). Curriculum for the non-compulsory school system, lpf94. Stockholm: The National Agency for Education.

Swedish National Agency for Education. (2001b). Natural science programme: Programme goal, structure and syllabuses. Stockholm: Fritzes.

Page 141: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

131

Differential Item Functioning for Items in the Swedish National Test in Mathematics, Course B Gunilla Näsström, Umeå University

In this paper I want to discuss differences between groups of students regarding results on items in the Swedish national test in course B in mathematics and try to find at least a part of an explanation as to why differences exist.

The national A-course test is compulsory for all students in upper secondary school since the autumn of 2000 (SFS 2001:56). The national test on course B is compulsory for the Social Science Programme (SP) and the Arts Programme (ES). For the Technology Programme (TE) the national test on course C is compulsory and for the Natural Science Programme (NV) the national test on course D is compulsory. For NV- and TE-programmes the national test on course B is strongly recommended (SFS 2001:56).

On course B in mathematics the students study four topics in mathematics, namely Algebra, Geometry, Statistics and probability, and Functions (Skolverket, 2000).

DIF and Bias

The National Agency for Education has stated that the national course tests should be as valid as possible and as fair as possible to all students. This statement can also be related to the ambition that the education in upper secondary school must be equal for all students (SFS 2001:56). A valid test should not consist of biased items. Bias is said to exist when a test or an item causes systematic errors in the measurement (Ramstedt, 1996). For example, if test scores indicate that boys perform better on a certain test than girls do, when in fact both sexes have equal ability, the test provides a biased measure. Something else than what was intended is measured in biased items.

Many aspects of a test and its use have to be considered when discussing test fairness, for example the way in which tests are used, the participants and the whole testing process (Willingham & Cole, 1997). People have different ideas of what constitutes fairness. There are also different perspectives on

Page 142: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

132

fairness, for example in matters of law and the distribution of resources (Wolming, 2000). Willingham and Cole (1997) defined a fair test as a test that is comparably valid for all individuals and groups. Fair test design should, according to them, “provide examinees comparable opportunity, as far as possible, to demonstrate knowledge and skills they have acquired that are relevant to the purpose of the test.” (p.10). The use of the test is an important part of the comparability.

Differential item functioning (DIF) is a collection of statistical methods that gives indications of items that are functioning differently for different groups of students. Hambleton et al (1991) defined DIF in the following way: “…an item shows DIF if individuals having the same ability, but from different groups, do not have the same probability of getting the item right.” (p. 110). But it can also be added, that in order to be able to deter-mine whether an item that shows DIF is biased or not, further analyses have to be done (Camilli & Shephard, 1994). It is then of interest to determine whether the differences depend on differences in ability of the compared groups (not biased) or on the item measuring something else than intended (biased).

Differences in mathematics

There are many studies that focus on differences between men and women in tests (e.g. Stage, 1985; Gipps & Murphy, 1994; Kimball, 1994; Ramstedt, 1996; Wang & Lane, 1996; Willingham & Cole, 1997; Galla-gher, De Lisi, Holst, McGillicuddy-De Lisi, Morely & Cahalan, 2000). One conclusion from several studies is that men have a better spatial ability than women (e.g. Geary, 1994). Men use this spatial ability more often than women when solving problems, which can give them advantages when solving certain kinds of problems in geometry (Geary, 1994). Many studies also indicate that women have better verbal skills than men (Willingham & Cole, 1997), which can give them advantages on items where communica-tion is important. Women also score higher on tests in mathematics that consist of items that match coursework (Willingham & Cole, 1997). Men tend to outperform women in geometry and in arithmetic and algebraic reasoning questions (Willingham & Cole, 1997). Women tend to be better at intermediate algebra and arithmetic and algebraic operations (Willingham & Cole, 1997).

Gallagher et al (2000) have studied gender differences in problem solving. They distinguished between conventional and unconventional problems.

Page 143: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

133

Conventional problems were defined as routine textbook problems with methods that are clearly defined. Unconventional problems are seldom presented in the textbooks and either require an unusual use of a familiar algorithm or can readily be solved by use of logical estimation or insight (Gallagher et al, 2000). The unconventional problems are more likely to be called problem solving according to the definition that is used in the na-tional Test Bank for Mathematics.

Gallagher et al (2000) found that men outperformed women in solving all kinds of problems, but that the differences were greater for problems requir-ing spatial skills or multiple solution paths than for problems requiring verbal skills or containing classroom-based content.

The reported studies contradict partly. In Gallagher et al (2000) all kind of problems favour men, but in other studies some kinds of items favour and other kinds of items favour women.

The differences between the sexes are smaller than between schools, social groups and ethnic groups according to Kimball (1994). Women’s underes-timation of their ability, teachers’ depreciation of the women’s results, men’s better strategies for speeded tests, and women’s tendency not to solve items that, to them, seem too difficult are also reasons why men perform better on problem solving items in tests (Kimball, 1994). Women also tend to study less mathematics in upper secondary school which can explain some of the differences between women and men (Willingham & Cole, 1997).

In Sweden there is an opportunity to study differences in test scores between women and men who have taken equivalent mathematics courses. Many countries have examination tests at the end of a programme of study and tests for admittance to tertiary education after upper secondary school. The students take the same tests, but have taken mathematics courses of different lengths. In Sweden the national course tests are given at the end of the course, because the aims of the tests are to support teachers’ grading of their students and make the grounds for assessment across the country as uniform as possible for each course in mathematics. Regardless of the programme, the students that take a certain course test have studied the same course(s). Therefore results from the national course tests are comparable for different programmes. This offers an opportunity to study differences between the sexes but also to study differences between programmes.

Page 144: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

134

Aim

The aim of this paper is to explore DIF, for girls and boys as well as for different national upper secondary programmes, in Swedish national tests in Mathematics, course B, from three perspectives.

This paper is reporting three studies. In the first study results from the Mantel Haenszel-method, with total test score as ability, on the four na-tional tests are examined. The second study compares use of the total test score with the use of test scores in each topic as alternative ways of measur-ing ability with the Mantel Haenszel-method. In the third study the stu-dents’ solutions for one item (no. 14) in the national test in Course B in mathematics in spring 2002 are examined.

Method

Data

The National Agency for Education collects data from a selection of upper secondary schools and municipal adult educational institutions in Sweden for every national test. About one sixth of all schools and municipal adult educational institutions have to send in results for their students from the national course tests every year. For each selected group the solutions of the items for one student (no. 7 on the group-list) have also been collected. The Department of Educational Measurement has access to data on the national course tests that are constructed at the department.

In this study data from four National tests in course B in Mathematics have been used. The four National tests took place in the autumn of 2000, the spring of 2001, the autumn of 2001 and the spring of 2002. The examined programmes are ES, SP, TE and NV.

Each item in the four studied National tests in course B in mathematics is categorized in the National Test Bank for Mathematics by several criteria (Hellström, Lindström & Wästle, in print). Some of these criteria are prob-lem-solving, reasoning, and the topics, Algebra, Geometry, Functions and Statistics and probability. In this study, the categorisations in the National Test Bank are used. The items that are categorized as Statistics and prob-ability are divided in two categories, namely Statistics and Probability, in the first study.

Page 145: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

135

Statistical method

Mantel-Haenszel

Mantel-Haenszel is a χ2-method that has been developed for DIF-studies and especially for items that are dichotomously scored (0 or 1). However, most items in the national course tests are polytomously scored. The poly-tomously scored items have been dichotomised by dividing the p-value by the maximum score, as Ramstedt did in his dissertation (1996).

In the Mantel-Haenszel-method two groups are compared for every total score. In the first study the ability of the student is measured as the total test score for the student. In the second study the ability of the student is meas-ured partly as the total score on the items in each topic of mathematics respectively and partly as total test score. The results of the calculations indicate whether the difference is significant as well as which group is fa-voured by that particular item.

The statistically significant level is set at 5 % in this paper.

When the numbers of individuals in one group of students were equal to or less than 20 the Mantel Haenszel-method was not used. For TE this happened in the autumn of 2000 and 2001 and for the comparisons between women and men from NV in the autumn of 2001.

Only significant differences are reported. The Mantel Haenszel-statistics are dependent on the size of the groups (Ramstedt, 1996). If the groups are too big all differences are significant, while for small groups there are no signifi-cant differences despite seemingly very large differences. In this paper no non-significant differences, regardless of how large they are, are reported. The groups are rather large, so the significant differences are classified by the rule of thumb from ETS into three groups (Ramstedt, 1996). For group A the DIF is negligible, for group B the DIF is moderate and for group C serious. In this paper the items that are classified as B or C, are collected to one group called serious DIF.

Item 14 in the national test on course b spring 2002

The students’ solutions on item 14 in the national test on course B in spring 2002 have been studied and categorized by type of solution.

Page 146: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

136

Item 14 from the national test on course B spring 2002:

Instructions for scoring item 14:

14. Max 1 /2 Acceptable attempt, found one of the conditions yv = or °=++ 180vzu +1 g Found one more condition +1 vg With correctly completed proof +1 vg

This item is categorized as Reasoning, Geometry and Algebra.

The solutions are categorized by the type of solutions the students present. The marking instructions gave the teachers an opportunity to interpret how much the student has to present to get full marks. To get an idea of what teachers in general demand on that item for full marks, I consulted six teachers who were in a meeting at the department to discuss future national tests in mathematics. These six teachers, three men and three women, come from different towns in Sweden and teach different groups of students.

Page 147: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

137

Results

Study one

Table 1. Number of items in four national tests on course B in mathematics that show DIF and the total number of items in these four national tests. The items are divided by topics of mathematics and types of items. An item can belong to more than one category.

Men-

Women

NV-SP NV-TE1 ES-SP NV sex2 SP sex Total no. of Items

M W NV SP NV TE ES SP M W M W

Algebra 11 19 10 8 7 2 2 5 2 9 4 11 53

-serious 3 2 6 1 1 2 2 2 2 6 2 5

Geometry 5 2 5 0 2 0 2 0 1 4 2 2 15

-serious 1 0 3 0 0 0 0 0 1 3 0 1

Statistics 1 1 0 1 1 0 0 2 0 0 1 0 6

-serious 0 0 0 0 1 0 0 2 0 0 0 0

Probability 12 0 0 7 1 2 0 2 3 1 10 0 16

-serious 6 0 0 3 0 1 0 0 2 1 4 0

Functions 4 3 5 2 2 3 0 1 3 2 0 1 18

-serious 1 2 5 0 2 0 0 1 0 1 0 1

Reasoning 11 6 7 6 5 2 1 5 5 4 6 2 31

-serious 1 1 4 2 2 1 0 3 3 3 0 1

Problem-solving

8 2 6 1 0 1 0 3 4 0 3 0 17

-serious 3 1 5 0 0 0 0 2 3 0 0 0

No of students

6823

7564

3485

5585

3485

1270

840 5585

1878

1495

2051

3420

1 No Mantel-Haenszel for groups up to 20 persons. No results from two tests, the total number of items for these comparisons are 30, 7, 9, 2, 9, 18 and 6, respectively. 2 1 No Mantel-Haenszel for groups up to 20 persons. No results from one test, the total number of items for these comparisons are 41, 12, 3, 12, 15, 24 and 11, respectively.

Page 148: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

138

Women - men

In table 1 it is possible to see some patterns. Items in Probability seem to favour men, about 75 % of all items in Probability favour men while none favour women. For six of the items in Probability the DIF is serious. The same pattern is found between men and women from SP, ten of 16 items in Probability favour men. Four of these items show serious DIF. However, for men and women from NV this pattern is not found.

NV - SP

Items that show DIF in geometry favour students from NV. Items catego-rized as Problem-solving favour students from NV. Items in Probability favour students from SP.

Study two

Table 2. Number of items in the national tests on course B in mathematics in the spring of 2002, that show DIF, when the ability is measured as total test score and as test score for each topic. The second row in the table, present how many of the items that show DIF in one measure of ability, also show DIF in the other measure of ability.

Total test score Test score in each topic

Women-men

NV-SP

NV-TE

ES-SP

To-tal

Women-men

NV-SP

NV-TE

ES-SP

To-tal

No. 15 14 11 5 45 14 11 7 3 35

No. Agree

13 10 7 2 32 13 10 7 2 32

The agreement between the two measures of ability is rather high as to which items that show DIF. 32 items show DIF in both measure of ablity. For example for men and women 13 items show DIF for both measures of ability, favouring the same sex. No item shows DIF favouring the opposite group. The two items in the test that are categorized as both Algebra and Geometry, show DIF for total score and for test score in Algebra, but not for test score in Geometry.

Page 149: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

139

Study three - Item no.14

Item no. 14 in the national test on course B in mathematics in the spring of 2002, shows DIF. In a comparison between men and women the item favour men, regardless of whether ability is measured as total test score or as test score on algebra-items. The item favours students from NV, in compa-rison with students from SP or TE, regardless of measure of ability. The DIF is serious when students from NV are compared with students from SP.

The total number of legible students’ solutions from the national test in the spring of 2002 was 222. One student’s solutions has been excluded since the teacher had not written the marks on the paper.

Five of the six teachers at the meeting at the department wanted the stu-dents to find both conditions, come to the right conclusion and explain at least one of the conditions to get full marks. One teacher did not think it was necessary to explain the conditions to get full marks.

38 % of the students presented a proof worth full marks (using mathemati-cal expressions and/or words), according to the six teachers at the meeting. Of the students from NV 63 % presented a proof worth full marks compa-red with only 20 % of the students from SP. A minimal proof, where none of the conditions where explained, was presented by 11 % of the students. 20 % of the students’ solutions contained only one or both of the condi-tions, but no proof or conclusion. The rest of the students had no solution or a solution that had nothing to do with the content of the item.

Out of the students who had presented a proof worth full marks, 74 % got full marks and 24 % got two marks. Of the students from NV 80 % got full marks compared with only 56 % of the students from SP. Most of the other students got two marks for their proof worth full marks. 71 % of the men and 78 % of the women got full marks for their proof worth full marks.

Of the students who had presented a minimal proof, which according to the teachers I had consulted was worth two marks, 48 % got full marks, 40 % two marks and 12% one mark. 74 % of the men and 17 % of the women got full marks for a minimal proof. Only 15 % of the men but 67 % of the women got two marks.

Discussion

Students should learn the same things on course B in Mathematics regard-less of programme, but there are obvious differences between the program-

Page 150: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

140

mes. How large a difference is acceptable when education for students on all programmes should be equivalent?

For example, items in Probability favour students from SP compared with students from NV. Is the test design fair according to the definition of fair test design that Willingham and Cole (1997) have? If all students have enough knowledge and skills in a particular topic, the differences in items in Probability indicate that the test design is not fair. It is possible, if not probable, that the balance in teaching between the four topics varies betwe-en programmes. Because of this, the observed differences in items are real differences and no bias. In that case the test can be fair, and the items give the students the same opportunity to demonstrate their knowledge and skills, but the students do not have the same knowledge and skills. Then the question is rather whether the teaching is fair.

The results of National tests are used as one indicator of the quality of education. For this reason it is important that the National tests are fair to all groups of students. If not, the results can show an incorrect picture of the quality of education for different groups and can lead to the resources for education being distributed in an unfair manner.

Women and men seem to learn different things during the same course. Items in Probability favour men, in the total group and in SP. Is it fair that one kind of items favour men in such a large extent in a national test? Is that a problem for the constructing of the national tests in course B? Proba-bility is part of the syllabus. But items in statistics, which do not have the same pattern, are fewer but also a part of the syllabus. Should the number of items in probability decrease and items in statistics increase? It is difficult to find good items in statistics [med tanke på A-kursen].

The two measures of ability seem to correlate. The point of measuring ability as test score for each topic can therefore be questioned. The numbers of marks in a couple of topics are rather small and this gives uncertainty of the measuring. It seems better to use the total score as the measure of ability.

The students from NV got more credits for their full marks solutions of item 14 compared with students from SP. Many of the solutions are written in words only and this can explain the differences when the teachers assess their students’ solutions. Students from NV seem to be better trained in how to use mathematical expressions and language, than students from SP. Men also got more credits for their minimal proofs in item 14 than women. This can be part of the explanation of why this item shows DIF favouring

Page 151: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

141

men. How can the instructions of marking be formulated to avoid this kind of differences in marking between groups of students, but still give the teachers opportunity to assess their students with consideration to how they have been taught mathematics?

References Camilli, G., & Shephard, L.A. (1994). Methods for identifying biased test items.

London: Sage Publications Ltd.

Gallagher, A. M., De Lisi, R., Holst, P.C., McGillicuddy-De Lisi, A.V., Morely, M. & Cahalan, C. (2000). Gender Differences in Advanced Mathematical Problem Solving. Journal of Experimental Child Psychology, 75 (165-190).

Geary, D.C. (1994). Children’s Mathematical Development. USA: American Psychological Association.

Gipps, C. & Murphy, P. (1994). A fair test? Assessment, achievement and equity. Buckingham, G.B.: Open University Press.

Hambleton, R.K., Swaminathan, H. & Rogers, H.J. (1991). Fundametals of Item Response Theory. USA: SAGE publications, Inc.

Hellström, T., Lindström, J-O. & Wästle, G. (in print). The classification of Items for a Test Bank for Mathematics. EM 37.

Kimball, M.M. (1994). Bara en myt att flickor är sämre i matematik. (It is only a myth that girls are poorer in mathematics.). Kvinnovetenskaplig tidskrift, 15(4), 39-53.

Ramstedt, K. (1996). Elektriska flickor och mekaniska pojkar. Om gruppskillnader på prov – en metodutveckling och en studie av skillnader mellan flickor och pojkar på centrala prov i fysik. (Electrical Girls and Mechanical Boys. On Group Differences in Tests – a Method Development and a Study of Differences between Girls and Boys in National Tests in Physics) Doctoral dissertation, University of Umeå.

SFS 2001:56. The Upper Secondary School Ordinance. Stockholm: Utbildningsdepartementet.

Skolverket. (2000). Naturvetenskapsprogrammet. Programmål, kursplaner, betygskriterier och kommentarer. (The Natural Science Programme. Programme goal, syllabi, grading criteria and comments.). Borås: Skolverket.

Stage, C. (1985). Gruppskillnader i provresultat. Uppgiftsinnehållets betydelse för resultatskillnader mellan män och kvinnor på prov i ordkunskap och allmänorientering. (Group differences in test results. The significance of test item contents for sex differences in results on test on vocabulary and general knowledge.) Doctoral dissertation, University of Umeå.

Page 152: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

142

Wang, N., & Lane, S. (1996). Detection of Gender-Related Differential Item Functioning in a Mathematics Performance Assessment. Applied Measurement in education, 9 (2), 175-199.

Willingham, W.W & Cole, N.S.. (1997). Gender and Fair Assessment. New Jersey, USA: Lawrence Erlbaum associates.

Wolming, S. (2000). Validering av urval. (Validation of selection). Doctoral dissertation, University of Umeå.

Questions I want to discuss:

1. Do you know of any studies concerning how teachers mark different groups of students on items that demand a solution be presented? Are there any studies that compare marking of solutions that are written in words only with solutions with mathematical expressions and language?

The Achilles’ heel in studies of DIF is the measure of ability. In this paper two different measures of ability are used, but both are internal, i.e. based on test score. Internal measures of ability are problematic. A dream is to find an exter-nal measure of ability, but that is also problematic. Grades are often used, but they are not unproblematic either. Women often have better grades than men and women earn higher grades on lower test results compared to men. One idea is to let teachers predict their students’ results on the next national test in course B in mathematics, using the same grading scale as the one used for the national test, extended with plus and minus. Teachers in Sweden are used to, and probably good at, grading their students. In Sweden we have no examination tests for external grading of students. Since the teachers have had a lot of prac-tice in grading their students they should be able to give good predictions of their future test results.

2. What is your opinion on measuring ability as total test score and as test score in each topic?

3. What is your opinion on using teachers’ prediction of their students’ result as a measure of ability?

Page 153: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

EDUCATIONAL MEASUREMENT Reports already published in the series EM No 1. SELECTION TO HIGHER EDUCATION IN SWEDEN. Ingemar

Wedman EM No 2. PREDICTION OF ACADEMIC SUCCESS IN A PERSPECTIVE OF

CRITERION-RELATED AND CONSTRUCT VALIDITY. Widar Henriksson, Ingemar Wedman

EM No 3. ITEM BIAS WITH RESPECT TO GENDER INTERPRETED IN THE

LIGHT OF PROBLEM-SOLVING STRATEGIES. Anita Wester EM No 4. AVERAGE SCHOOL MARKS AND RESULTS ON THE SWESAT.

Christina Stage EM No 5. THE PROBLEM OF REPEATED TEST TAKING AND THE SweSAT.

Widar Henriksson EM No 6. COACHING FOR COMPLEX ITEM FORMATS IN THE SweSAT.

Widar Henriksson EM No 7. GENDER DIFFERENCES ON THE SweSAT. A Review of Studies since

1975. Christina Stage EM No 8. EFFECTS OF REPEATED TEST TAKING ON THE SWEDISH SCHO-

LASTIC APTITUDE TEST (SweSAT). Widar Henriksson, Ingemar Wedman

1994 EM No 9. NOTES FROM THE FIRST INTERNATIONAL SweSAT CONFEREN-

CE. May 23 - 25, 1993. Ingemar Wedman, Christina Stage EM No 10. NOTES FROM THE SECOND INTERNATIONAL SweSAT

CONFERENCE. New Orleans, April 2, 1994. Widar Henriksson, Sten Henrysson, Christina Stage, Ingemar Wedman and Anita Wester

EM No 11. USE OF ASSESSMENT OUTCOMES IN SELECTING CANDIDATES

FOR SECONDARY AND TERTIARY EDUCATION: A COMPARISON. Christina Stage

EM No 12. GENDER DIFFERENCES IN TESTING. DIF analyses using the Mantel-

Haenszel technique on three subtests in the Swedish SAT. Anita Wester 1995 EM No 13. REPEATED TEST TAKING AND THE SweSAT. Widar Henriksson

Page 154: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

EM No 14. AMBITIONS AND ATTITUDES TOWARD STUDIES AND STUDY RESULTS. Interviews with students of the Business Administration study program in Umeå, Sweden. Anita Wester

EM No 15. EXPERIENCES WITH THE SWEDISH SCHOLASTIC APTITUDE TEST. Christina Stage EM No 16. NOTES FROM THE THIRD INTERNATIONAL SweSAT

CONFERENCE. Umeå, May 27-30, 1995. Christina Stage, Widar Henriksson

EM No 17. THE COMPLEXITY OF DATA SUFFICIENCY ITEMS. Widar

Henriksson EM No 18. STUDY SUCCESS IN HIGHER EDUCATION. A comparison of

students admitted on the basis of GPA and SweSAT-scores with and without credits for work experience. Widar Henriksson, Simon Wolming

1996 EM No 19. AN ATTEMPT TO FIT IRT MODELS TO THE DS SUBTEST IN THE

SweSAT. Christina Stage EM No 20. NOTES FROM THE FOURTH INTERNATIONAL SweSAT

CONFERENCE. New York, April 7, 1996. Christina Stage 1997 EM No 21. THE APPLICABILITY OF ITEM RESPONSE MODELS TO THE

SWESAT. A study of the DTM subtest. Christina Stage EM No 22. ITEM FORMAT AND GENDER DIFFERENCES IN MATHEMATICS

AND SCIENCE. A study on item format and gender differences in perfor-mance based on TIMSS´data. Anita Wester, Widar Henriksson

EM No 23. DO MALES AND FEMALES WITH IDENTICAL TEST SCORES

SOLVE TEST ITEMS IN THE SAME WAY? Christina Stage EM No 24. THE APPLICABILITY OF ITEM RESPONSE MODELS TO THE

SweSAT. A Study of the ERC Subtest. Christina Stage EM No 25. THE APPLICABILITY OF ITEM RESPONSE MODELS TO THE

SweSAT. A Study of the READ Subtest. Christina Stage EM No 26. THE APPLICABILITY OF ITEM RESPONSE MODELS TO THE

SweSAT. A Study of the WORD Subtest. Christina Stage EM No 27. DIFFERENTIAL ITEM FUNCTIONING (DIF) IN RELATION TO

ITEM CONTENT. A study of three subtests in the SweSAT with focus on gender. Anita Wester

Page 155: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

EM No 28. NOTES FROM THE FIFTH INTERNATIONAL SWESAT CONFERENCE. Umeå, May 31 – June 2, 1997. Christina Stage

1998 EM No 29. A COMPARISON BETWEEN ITEM ANALYSIS BASED ON ITEM

RESPONSE THEORY AND ON CLASSICAL TEST THEORY. A Study of the SweSAT Subtest WORD. Christina Stage

EM No 30. A COMPARISON BETWEEN ITEM ANALYSIS BASED ON ITEM

RESPONSE THEORY AND ON CLASSICAL TEST THEORY. A Study of the SweSAT Subtest ERC. Christina Stage

EM No 31. NOTES FROM THE SIXTH INTERNATIONAL SWESAT

CONFERENCE. San Diego, April 12, 1998. Christina Stage 1999 EM No 32. NONEQUIVALENT GROUPS IRT OBSERVED SCORE EQUATING.

Its Applicability and Appropriateness for the Swedish Scholastic Aptitude Test. Wilco H.M. Emons

EM No 33. A COMPARISON BETWEEN ITEM ANALYSIS BASED ON ITEM

RESPONSE THEORY AND ON CLASSICAL TEST THEORY. A Study of the SweSAT Subtest READ. Christina Stage

EM No 34. PREDICTING GENDER DIFFERENCES IN WORD ITEMS. A

Comparison of Item Response Theory and Classical Test Theory. Christina Stage

EM No 35. NOTES FROM THE SEVENTH INTERNATIONAL SWESAT

CONFERENCE. Umeå, June 3–5, 1999. Christina Stage 2000 EM No 36. TRENDS IN ASSESSMENT. Notes from the First International SweMaS

Symposium Umeå, May 17, 2000. Jan-Olof Lindström (Ed) EM No 37. NOTES FROM THE EIGHTH INTERNATIONAL SWESAT

CONFERENCE. New Orleans, April 7, 2000. Christina Stage 2001 EM No 38. NOTES FROM THE SECOND INTERNATIONAL SWEMAS

CONFERENCE, Umeå, May 15-16, 2001. Jan-Olof Lindström (Ed) EM No 39. PERFORMANCE AND AUTHENTIC ASSESSMENT, REALISTIC

AND REAL LIFE TASKS: A Conceptual Analysis of the Literature. Torulf Palm

Page 156: Proceedings of the Third International SweMaS Conference ... · Proceedings of the Third International SweMaS Conference Umeå, ... to the education and training system to ensure

EM No 40. NOTES FROM THE NINTH INTERNATIONAL SWESAT CONFERENCE. Umeå, June 4–6, 2001. Christina Stage

2002 EM No 41. THE EFFECTS OF REPEATED TEST TAKING IN RELATION TO

THE TEST TAKER AND THE RULES FOR SELECTION TO HIGHER EDUCATION IN SWEDEN. Widar Henriksson, Birgitta Törnkvist

2003 EM No 42. CLASSICAL TEST THEORY OR ITEM RESPONSE THEORY: The

Swedish Experience. Christina Stage EM No 43. THE SWEDISH NATIONAL COURSE TESTS IN MATHEMATICS.

Jan-Olof Lindström EM No 44. CURRICULUM, DRIVER EDUCATION AND DRIVER TESTING. A

comparative study of the driver education systems in some European countries. Henrik Jonsson, Anna Sundström, Widar Henriksson

2004 EM No 45. THE SWEDISH DRIVING-LICENSE TEST. A Summary of Studies from

the Department of Educational Measurement, Umeå University. Widar Henriksson, Anna Sundström, Marie Wiberg

EM No 46. SweSAT REPEAT. Birgitta Törnkvist, Widar Henriksson EM No 47. REPEATED TEST TAKING. Differences between social groups. Birgitta

Törnkvist, Widar Henriksson