assessment in England 1989–2008 A brief history of a ... · A brief history of a testing time:...

This article was downloaded by: [Open University]On: 29 August 2014, At: 00:30Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: MortimerHouse, 37-41 Mortimer Street, London W1T 3JH, UK

Educational ResearchPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/rere20

A brief history of a testing time: national curriculumassessment in England 1989–2008Chris Whetton aa National Foundation for Educational Research , UKPublished online: 20 May 2009.

To cite this article: Chris Whetton (2009) A brief history of a testing time: national curriculum assessment in England1989–2008, Educational Research, 51:2, 137-159, DOI: 10.1080/00131880902891222

To link to this article: http://dx.doi.org/10.1080/00131880902891222

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose ofthe Content. Any opinions and views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be reliedupon and should be independently verified with primary sources of information. Taylor and Francis shallnot be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and otherliabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/rere20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/00131880902891222

http://dx.doi.org/10.1080/00131880902891222

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

A brief history of a testing time: national curriculum assessment in

England 1989–2008

Chris Whetton*

National Foundation for Educational Research, UK

(Received 1 August 2008; final version received 5 December 2008)

Background: National curriculum assessment (NCA) in England has been inplace for nearly 20 years. It has its origins in a political desire to regulateeducation, holding schools accountable. However, its form and nature also reflecteducational and curriculum concerns and technical assessment issues.Purpose: The aim of the article is to provide a narrative account of thedevelopment and changes in NCA in England from its initiation to 2008 and toexplain the reasons for these.Sources of evidence: The sources quoted are in the public domain, but inaddition to academic articles, include political biographies and published officialpapers.Main argument and conclusions: NCA in England has evolved over 20 years,from an attempt at a criterion-referenced system based on tasks marked by thechildren’s own teachers through to an externally marked examination system.This change reflects the political purposes of the system for accountability, andthe pressure associated with this has led to growing criticism of the effects onchildren and their education. Nonetheless, the results provided are widely used bythe public and government, and the reasons for the survival of the system lie inboth its utility and the difficulty of identifying a new system which is necessarilyan improvement for all the stakeholders involved.

Keywords: assessment; testing; National Curriculum; SATs; assessment policy

Introduction

The history of national curriculum assessment (NCA) in England is a story with

several layers and several stages. The first layer is the detailed assessment system

itself – the tests, tasks and classroom assessments, together with their adminis-

trative, standard-setting and reporting arrangements. These have technical aspects

including their reliability, validity and manageability. The second layer is the wider

educational context of the curriculum (these are, after all, curriculum assessments),

schools’ practices and even the relationship between schools and society,

particularly parents. The third layer is the wider political and social milieu, which

must be seen in the context of the initial motive for the assessment system and

the reasons for its continuation. NCA was initiated by an Act of Parliament in

1988 and its structures, uses and instruments have been the subject of constant

political debate and decision. At times, this has involved personal interventions

*Email: [email protected]

Educational Research

Vol. 51, No. 2, June 2009, 137–159

ISSN 0013-1881 print/ISSN 1469-5847 online

� 2009 NFER

DOI: 10.1080/00131880902891222

http://www.informaworld.com

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

from the highest levels, with British prime ministers or their offices involved

and disagreements between those offices and the government’s education

department. This political layer must also be seen as incorporating the wider

social changes in Britain, and especially the initial Thatcherite1 distrust of

established professional interests and the later Blairite2 focus on the management

mechanisms of accountability and target setting as a means of effecting change in

public services.

A persistent aspect of the history is that the layers occasionally came into

collision. When this did happen, the language, values and beliefs of the three layers

differed and hence communication between them and understanding of each other

was limited.

The stages of the history have been: the initial establishment of a system; a

turbulent revolt and a simplification of that system; then a period of relative stability

but with a gradual increase in the uses and purposes of the assessment system and a

growth in the influence of the accountability purpose; and most recently, a search for

an alternative system.

There is a final introductory element to state. This is a ‘brief history’, so it is

inevitably selective. There is much more that could be written. As the author of this

‘brief history’, I have attempted to adopt the role of narrator. I should acknowledge

that I have been involved in various ways in each of the stages described, generally

as test developer or researcher. Consequently, I cannot but have my own views

and beliefs, which inevitably intrude into such a narrative. The personal involvement

means that there are some statements for which I have no written source quoted.

The sources quoted are in the public domain, but in addition to academic articles,

include political biographies and published official papers. I have not utilised media

sources (which could tell another story) except as referred to in the referenced

documents.

Although I have attempted is to be a neutral narrator, a degree of bias is

inevitable and others will differ in their interpretation of events. This is the nature of

a history and if this piece provokes discussion and other views, it will have

succeeded.

The pre-cursors

The starting point for the action of NCA was the return of Margaret Thatcher for

her third term of office in 1987. The Conservative manifesto had included a

promise of educational reforms and the new government set about introducing

these. This was not, however, a sudden whim. Dissatisfaction within governments

with education had been growing since the 1970s, as set out in Michael Barber’s

(1996) The National Curriculum: a study in policy, and other political accounts (e.g.

Graham and Tytler 1993). These sources point to several important factors. The

first was a growing discomfort with the adherence to child-centred approaches in

primary schools following the Plowden report of 1967 (DfES 1967). This

culminated in the scandal of the William Tyndale School, which received great

media publicity in 1975. In that school in London, the principal and teachers of

the school took child-centeredness to extremes, defying any attempts at control by

parents or the local authority. This became a symbol of the seemingly total absence

of accountability in schools at the time. A second factor was the growing

realisation of the changing educational needs of Britain, which followed from the

138 C. Whetton

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

economic upheavals from the 1970s onwards, towards the continuing globalisation

of the present. The economic crisis of the mid-1970s was one of the prompts for

the first public governmental statement of concerns, by the then (Labour) prime

minister, James Callaghan, in 1976 at Ruskin College.3 As he often did, he

attempted to steer a middle way, emphasising the high standards of the past but

referring to the need not just to maintain standards but to improve them. He

recognised the professionalism and vocation of teachers but lamented complaints

from industry that recruits did not have the basic tools required. He alluded to the

resources that had been injected into education and the £6 billion spent on it, and

almost invited the teaching profession to put its house in order ‘For if the public is

not convinced then the profession will be laying up trouble for itself in the future.’

This warning was not heeded.

Callaghan soon departed from power, and the country moved into 18 years of

Conservative rule, initially under Margaret Thatcher. Her first two governments

were concerned principally with areas other than education, but ministers, for

example the Chancellor of the Exchequer4 (see memoirs of Nigel Lawson, 1992),

were heavily influenced by the historical analysis of Correlli Barnett (1986), whose

The audit of war described the decline in British power during the twentieth century,

and attributed this in part to the failure of education.

This background remains of importance in current debates about NCA.

Accountability in education is now a genie that is out of the bottle and will have

to continue in some form. The belief in the link between educational standards and

economic success as a country has grown stronger. Twenty or more years after the

concerns outlined here a government minister from a different party was still

alluding to the 40 years of static standards, contrasting it with rises in the 1990s

(Miliband 2003).

In their third term in 1987, the Conservatives turned their attention to education.

Margaret Thatcher in her biography says:

The starting point for the education reforms . . . was a deep dissatisfaction (which I fullyshared) with Britain’s standard of education. There had been improvement in the pupil–teacher ratio and real increases in spending per child. But increases in public spendinghad not by and large led to higher standards. (Thatcher 1993, 590)

And so in 1988, the Education Reform Act established by law that there should be a

national curriculum and provided that, as an integral element of this, there should be

‘arrangements for assessing pupils at or near the end of each key stage for the

purpose of ascertaining what they have achieved in relation to the attainment targets

for that stage’ (Education Reform Act, 1, 2: (2)).

The term ‘key stage’ was a new one to the education system in England. There

were four key stages, which were: five to seven years old (key stage 1); eight to 11

years old (key stage 2); 12 to 14 years old (key stage 3); and 15 to 16 years old (key

stage 1). Also introduced was a common labelling to describe the years of education,

from Year 1 (broadly six-year-olds) to Year 11 (broadly 16-year-olds in the last year

of compulsory education). Prior to this, there had been many organisational patterns

in different parts of England. One effect of the Act and the testing system was to

bring greater coherence to the system as a whole and to provide a clear demarcation

between the end of primary schooling (key stage 2) and secondary education in key

stages 3 and 4.

Educational Research 139

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

The early developments

The suddenness of the decision and the politically driven need for speed of

implementation meant that there was no existing model for putting these

arrangements into place.

The Education Reform Act specified that there should be an assessment system

with testing at or about the end of each of the four key stages. The form of this

testing, the uses of the results and all other details were left open to be decided by the

Secretary of State for Education. Several accounts of this first implementation have

been published. These give the story from the viewpoints of various participants in

the events. Among them are: the politician’s story (Baker 1993); the chief executive’s

story (Graham and Tytler 1993); the system architect’s story (Black 1997), the

Council member’s story (Daugherty 1995); the union official’s story (Barber 1996)

and finally the test developer’s story (Sainsbury 1996). In addition, there were

evaluation projects examining different aspects of the process (e.g. Shorrocks et al.

1992; Ruddock and Tomlins 1993; Gipps et al. 1995). The fullest account, which

merges the three levels of political detail, educational considerations and technical

understanding, is Shorrocks-Taylor (1999).

Overall responsibility for the tests has rested with a succession of statutory

bodies,5 initially the Schools Examination and Assessment Council (SEAC) created

by the 1988 Education Reform Act. In 1993, as part of the response to the crisis of

that year, this merged with the National Curriculum Council to become the School

Curriculum and Assessment Authority (SCAA), which itself merged with the

vocational assessment agency (the National Council for Vocational Qualifications)

to become the Qualifications and Curriculum Authority (QCA) in 1997. During

2008, it was announced that QCA itself would be broken up, with its regulatory

function passing to a new agency.

Each of these statutory bodies had its own particular approach and each was

influenced by the politics of its time as well as the views of its senior staff. This in part

accounts for the lack of coherence in the system as a whole, which will become

evident later in this account.

The blueprint

An initial blueprint for the assessment system was provided by a group known by the

acronym TGAT. This stood for Task Group on Assessment and Testing and was

composed largely of educationalists. The group was chaired by Paul Black, an

eminent science educator, and consisted of two directors of research or assessment

organisations, two head teachers, a chief education officer, an ex-chief HMI for

Primary Education, an economic researcher, an emeritus professor of electronics and

a personnel director of an engineering firm. The group immediately set about

devising a system for the curriculum and its assessment. Their task was not an easy

one. The terms of reference of the group were to advise on:

. . . the practical considerations which should govern all assessment including testing ofattainment at age approximately 7, 11, 14 and 16, with a national curriculum; includingthe marking scale or scales and kinds of assessment including testing to be used, theneed to differentiate so that the assessment can promote learning across a range ofabilities, the relative role of informative and of diagnostic assessment, the uses to whichthe results of assessment should be put, the moderation requirements needed to secure

140 C. Whetton

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

credibility for assessments, and the publication and other services needed to support thesystem – with a view to securing assessment and testing arrangements which are simpleto administer, understandable by all in and outside the education service, cost-effective,and supportive of learning in schools. (DES and WO 1988)

TGAT proposed an assessment system to meet five distinct purposes. These were

that it should be:

. formative – providing information on where a pupil is, enabling teachers to

plan the next stages;

. summative – providing overall information on the achievement of pupils;

. evaluative – providing aggregated information on classes and schools to assess

curriculum issues, as well as the functioning of teachers and schools;

. informative – providing information to parents about their own children and

general information about the school;

. for professional development – giving teachers greater sophistication in

assessment, recording and monitoring so that they can evaluate their own

work.

To meet these objectives, TGAT proposed an innovative assessment system (DES

and WO 1988). Attainment was to be measured on a continuous scale of 10 levels,

covering the entire five to 16 age range. This 10-level scale was to be criterion

referenced with each level defined by a set of criteria. Pupils were to attain a level by

demonstrating the performance set out in these criteria.

The full TGAT scheme had many elements and ramifications. It was initially

hailed as a success because it appeared to meet both the government’s desire for a

system providing accountability and the teachers’ desire for a system based on

professional judgement and diagnostic assessment.

Immediately though, there were political problems. Margaret Thatcher found the

proposals:

a weighty, jargon filled document in my overnight box with a deadline for publicationthe following day. The fact that it was then welcomed by the Labour party, TheNational Union of Teachers and the Times Educational Supplement was enough toconfirm for me that its approach was suspect. It proposed an elaborate and complexsystem of assessment – teacher dominated and uncosted. (Thatcher 1993, 594)

In a sense this was correct, as the seeds of trouble were sown by TGAT or the

manner in which it was implemented, though Paul Black continued to defend the

principles for many years (e.g. Black 1997).

The TGAT model as implemented in terms of curriculum structure was such that

for each level of each attainment target there were between one and 10 statements of

attainment that defined it and that required assessment in some way. There were also

many attainment targets for each subject. As a consequence, across a subject, there

were a large number of statements. The attainment targets in each subject and the

statements of attainment defining each level were determined by separate working

groups of experts in each subject. Thus the criteria did not represent empirical

findings about pupils’ attainments, but rather the judgements of the working groups

comprising experts and interested parties. They were aspirational targets in the sense

of what ought to be achieved rather than being defined in any research-based


Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

empirical sense. They also differed between subjects, as each reflected its own

structures and assumptions. In addition, adding to the incoherence, there was a sort

of competition for the curriculum, as subject groups expanded their own subject to

ensure its importance as part of the whole. Science, for example, ‘stole’ earth science

from the geographers.

Key stage 1 and standard assessment tasks (SATs)

The assessment of children at the end of key stage 1, i.e. for seven-year-olds, was the

first part of the assessment regime to be put into place. This was largely an accident,

since this was the key stage that was shortest and therefore required end of key stage

assessments first. In many ways this was unfortunate, since the assessment of

children at this age was the most unknown area and also the most problematic.

Established assessments of children at this age used individual administration or

were relatively informal, but for the first time, these were having to be applied to a

compulsory universal testing regime. The tasks initially involved three subjects

(mathematics, English and science), covering a sample of nine attainment targets

leading to over 100 statements of attainment, which had to be assessed for each

child. For teachers with large classes, this multiplied up to thousands of judgements

to be made and recorded. Other subjects were to be added subsequently.

In 1989, when the National Curriculum was beginning to be introduced, three

separate development agencies were commissioned to develop their own model of

standard assessment tasks for seven-year-olds (known at the time as SATs) following

the blueprints given in the TGAT report. These SATs were to be large-scale, cross-

curricular tasks and break new ground in assessment. The three teams were:

. Consortium for Assessment and Testing in Schools (CATS), which comprised

the Institute of Education, London East Anglian Group for GCSE and

Hodder & Stoughton Publishers;

. Standard Tests and Implementation Research (STAIR), based at Manchester

University and including several local education authorities;

. NFER/BGC, which comprised NFER, Bishop Grosseteste College, Lincoln,

NFER-Nelson Publishing Company, and West Sussex and Sheffield Local

Education Authorities (LEAs).

In order to attempt to deal with the vast number of assessments that had to be made,

the three agencies adopted different strategies. CATS assessed every attainment target,

but not every statement. NFER sampled the attainment targets but attempted to assess

every statement within those selected. STAIR made a valiant attempt to assess every

statement of each attainment target, producing small mountains of paper-based tasks.

All the agencies ran trials in 1990, which demonstrated that none of the models was in

any way manageable, and so SEAC produced a new specification for the first full pilot

in 1991, asking all three agencies to re-tender against this. The NFER consortium was

successful and proceeded to the first full implementation. The new specification

required schools and teachers to assess a selection of attainment targets, some of which

were compulsory and some which could be selected from a range of options.

This early development illustrated well the tensions between: authenticity in the

assessment as representing the curriculum (and how children respond to it); the

requirements of accountability and reliability; and manageability in terms of

142 C. Whetton

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

teachers’ time, their classroom control and children’s time (Sainsbury 1996). The

initial statutory attempt to meet these tensions was the 1991 SATs. These tasks were

designed to reflect quality work in infant classrooms and featured assessments of

investigations taking place with small groups of children. Reading was assessed

through an individual interview in which a passage from one of 27 specified books

had to be used. For the assessment of the attainment targets, each statement of

attainment was interpreted in the context of the specific activity, giving a basis for

teachers to make their judgements of children. In order to reduce the number of

assessments needing to be made, a crude form of tailored testing operated. When the

first assessment took place in primary schools across the country, there was a great

outcry about the assessment load. The work involved class teachers for too long – an

average of around 44 hours whereas the specification had been for 30 hours of

teacher time. Children’s education was disrupted for far too long. The reliability of

the SATs was criticised, since many of the tasks involved individualistic interactions

with children, and also because teachers found it difficult to assess children working

collaboratively. The SATs had attempted to meet some of the principles of sound

classroom-based educational assessment by providing a means whereby all children

could show their best, but this was unmanageable in a mass testing system.

However, there were positive outcomes, which were largely associated with the

effects on the curriculum and on teachers’ knowledge and skills in assessment. For

the first time ever, scientific and mathematical investigations were taking place in

every infant classroom and teachers were involved in making assessments of

children’s work in these areas. LEA advisers claimed that they had striven for years

to achieve this and had not succeeded. Compulsory authentic assessment had an

immediate effect. But for many, the gains in beneficial effects on the curriculum were

outweighed by the manageability and reliability arguments (Gipps et al. 1995).

The key stage 1 experience had proved very difficult, but it might have succeeded.

The primary teachers involved were coming to grips with the nature of the tasks and

were generally willing to try. The same was not the case for the next key stage to

come on stream: the testing in secondary schools of 14-year-olds. The same model of

development was followed, with a number of different agencies following differing

models, but still classroom tasks as advocated by TGAT. However, a series of crises

changed all that.

Key stage 3 and the crisis

In 1990, Kenneth Clarke became secretary of state6 for education in the final few

weeks of Margaret Thatcher’s government. He was reappointed by John Major, the

new (still Conservative) prime minister, and began to engage with the issues of the

curriculum and its assessment. Clarke liked to be regarded as a plain-speaking man,

and on being shown the pilot key stage 3 assessment materials, he publicly

denounced them as ‘elaborate nonsense’. The specifications were rapidly rewritten,

the contracts of the development agencies terminated and policy moved to the use of

‘written terminal examinations’ rather than the SATs.7 In English, the new agency,

Northern Examinations Association, lasted less than a year before it too was

replaced, this time by the agency that would undertake the work for over a decade,

the University of Cambridge Local Examination Syndicate (UCLES).

The new development agencies attempted to develop the new instruments very

quickly, but were hampered by being caught in a no-man’s-land between formal


Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

testing and the criterion-referenced nature of the curriculum. The items within the

tests remained tied to the statements of the attainment targets and elaborate rules

were needed to aggregate the item results to the award of a level. These created

anomalies and complexities that endangered the reliability of the results awarded

(Ruddock and Tomlins 1993).

The real battleground though was the tests of English at key stage 3. This was

perhaps the first overt clash between the educational and the political layers. With

hindsight, a crisis was unavoidable, given a combination of hubristic political

involvement, the competing statutory agencies responsible (one for curriculum and

another for assessment), the lack of continuity of development agencies and the

speed of implementation. The crisis had the inevitability of a Shakespearian tragedy

and it was perhaps fitting that Shakespeare’s plays should be the centre of events.

Secondary English teachers proved to be the least pliant members of their

profession and the decisions of yet another Secretary of State, John Patten, enraged

them further. The 1993 tests were to include the curriculum subject of English for the

first time, but unlike science and mathematics, there had been no large pilot of the

English tests in 1992. English teachers did not know what to expect and began to

agitate. Patten then announced that there would be a compulsory test on one of three

specified Shakespeare plays. While the teaching and testing of Shakespeare was

objected to in principle by some, the majority of English teachers felt it could not be

fair without a reasonable time to prepare, which the late announcement did not

allow.

Pressure for a boycott of the tests grew, and the teachers’ unions responded by

balloting their members on the issue. In the National Union of Teachers ballot,

over 90% said they would support a boycott. Other teachers’ unions joined the

revolt but the inept Patten and the head of SCAA, Lord Griffiths, stood firm. Lord

Griffiths even declared that the tests were the best prepared in the history of

education, a statement manifestly untrue. Nevertheless, the government did fight

back, through a tame Local Education Authority, Wandsworth. This took the

form of legal action in the High Court against a teachers’ union, the National

Association of Schoolmasters/Union of Women Teachers (NAS/UWT), seeking to

prevent the boycott of the tests. The union, however, won the case, not on the

grounds of the underprepared tests or their educational value, but on the grounds

of its members’ workload. Teachers could not be required to undertake the extra

unpaid work of marking the tests. The government’s policy was in disarray and, to

all intents and purposes, the 1993 tests did not take place. This legal decision had

important consequences for the later structure of NCA and, unknowingly, for

future crises.

A white knight was required to rescue the system, and this took the form of Sir

Ron Dearing. The Dearing review streamlined the curriculum, decided (after some

consideration) to retain the level-based system but reduced this to eight levels, and

reduced markedly the requirements for record keeping in support of teacher

assessment, which was retained and granted a notional equal status with the test

results. The tests themselves were streamlined and changed in focus from ostensibly

criterion-referenced instruments to mark-based tests. The need for standard setting

and maintenance procedures, which became major issues later, came about as a side

effect of this. Finally, the tests were in future to be externally marked, resolving the

workload issue of the Wandsworth case (Dearing 1994). This response, forced by the

teachers, paradoxically took the system further from one of the original educational

144 C. Whetton

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

objectives of TGAT, that of the teachers’ marking their own students work and

deriving useful feedback from the process.

In general terms, in the five years after 1991, the balance between validity,

manageability and reliability gradually (and at times rapidly) swung away from

validity toward manageability and reliability. This became inevitable once the

teachers’ unions, which organised a boycott of the assessments in 1993, had

determined that the paramount issue was manageability. The use of the results for

accountability purposes – the publication of school results from 1996 – also pushed

in the direction of reliability through standardisation. The advocates of curriculum

authenticity could not stand against these arguments. The trend was therefore

toward a reduction in the number of subjects tested (science was no longer included

for seven-year-olds) and in the aspects involved (any attainment targets involving

practical elements were generally excluded).

In response to widespread criticism, the curriculum had been reviewed and

reduced in size and structure as part of the Dearing reforms (Dearing, 1994). The

attainment targets themselves were altered in structure from the statements of

attainment to a more broadly based hierarchy of level descriptions, which were

intended for the teacher to find a best fit to the child’s attainments. The number of

levels was reduced from 10 to eight. It was a model that based assessment on direct

global judgement rather than on separate elements that had then to be aggregated.

As such, it was well suited to teacher assessment, but it became the case that test

developers had generally to explicitly redefine elements in order to arrive at questions

for the National Curriculum tests.

An interesting question to ponder is why the dangers of criterion referencing were

not recognised at the outset of the National Curriculum. Some aspects were, in that

the dangers of minimum competency tests were avoided with the progressive level-

based system. However, the problems of large numbers of criterion-referenced

statements were ignored. This did not seem to influence either TGAT or the subject

working groups. By the time test developers came on the scene, the statements of

attainment were established facts and the direction given to development agencies

was that these must be individually assessed. It should not have taken the expensive

lessons of 1993 to have realised the impossibility of the early curriculum and

assessment model.

The Dearing reforms effectively set the form of the NCA system in place for the

next 12 years. There have been other perturbations and flurries of political interest,

but these were very minor compared with the 1993 crisis. If NCA has a hero, it has to

be Ron Dearing, and a grateful government ennobled him in 1998 for this and other

services to education.

Key stage 2 tests

In contrast to the key stage 1 and key stage 3 tests, those at key stage 2 did not go

through an initial stage of being open-ended tasks, then changing specification and

style. By the time the development of the tests for 11-year-olds began in 1994, the

specification was firmly for timed written tests. These were introduced in

mathematics, science and English. After initial uncertainty, reading and writing

were separately reported but a combined English level was also calculated. In all

subjects, the suite of tests had assessment tasks for levels 1 and 2 (which became

optional from 1997), then written tests for levels 3 to 5, which covered the majority


Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

of children. Initially there was an additional level 6 test, requested by ministers to

encourage gifted pupils, but this was dropped after 2002. Otherwise, in general, the

tests remained similar across the years. For mathematics, a test of mental arithmetic

was introduced in 1998.

In broad terms, judged as written tests of a part of the curriculum, the key stage 2

tests are good examples, which stand up to scrutiny (Rose 1999). However, there are

aspects of the curriculum that are formally omitted, such as speaking and listening in

English, investigations in mathematics and practical investigative science. As such,

they are only a partial measure of the full subject curriculum.

The making of the tests

As outlined above, the political expediency for the tests brought them into being, but

somehow this had to be managed and delivered. This was hardly discussed in TGAT,

which covered it in one paragraph, without making any recommendations.

Starting from nothing, a system had to be created to develop tests for three age

groups in several subjects, get these printed for about 1.8 million children, delivered

to about 25,000 schools, collected back again, marked reliably and then the results

collected, collated and published. Stated in these terms, it can be seen to be a

considerable logistical exercise, of the sort that has many possibilities for going

wrong. That it has happened each year until 2008, generally without large disasters

(although some small ones) can be seen as a success for those involved. Whatever the

views about the quality or effects of the tests, the system has largely worked. This is

all the more remarkable when the number of organisations involved is considered.

In large part, the chosen method of managing this complex system has been to

divide it into elements and to contract agencies expert in the particular functions to

undertake these, with the statutory body (SEAC, SCAA or QCA sequentially)

overseeing the whole process. Generally, the contracts have been for a span of three

or four years but there have been instances of earlier terminations or longer

contracts. It is difficult and would be tedious to describe the whole process and there

are few sources of description of the full complexity.

The test development process of 1999 is described in Rose (1999, Appendix 3)

and the current procedure is available on QCA’s website.8

Broadly speaking, the development process follows a classical pattern. In

simplified terms, items (questions) are written by subject matter experts, then formed

into several trial versions, which are administered to substantial numbers of students

with the responses being analysed through a variety of statistical item analysis

methods. A final version of the test (and a reserve test) are constructed from this

information and, in a second pre-test a year ahead of its actual use, the test is

administered in secure conditions to students at about the same time that they take

the current year’s test. Some students also take an unchanging anchor test. These

processes are designed to enable the new tests to be equated to the standards of

previous years. There are also a considerable number of review processes from

curriculum experts and other specialists in equality and diversity. Special

arrangements and modifications are produced for pupils with various special needs

taking the tests and these are described in Modified test administrators’ guides

available from QCA each year.

The agencies involved in test development have been university departments,

examination boards, research institutes and, for a period, QCA itself. There has been

146 C. Whetton

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

a narrowing of suppliers over the years, with academic institutes largely withdrawing

and only specialist assessment agencies remaining. This can probably be seen as a

response to the increasing stringency of the contracts, which came to contain many

penalties and other legal requirements, and to the difficulty of aspects of the

development process.

In an early sign of this narrowing of suppliers, in 1998, when the contracts for the

2000–03 tests were being let, QCA decided to take the mathematics development

work in house. It recruited and created an internal team to undertake the

development of the mathematics tests for all three key stages. However, the

quinquennial review of QCA in 2002 pointed to the anomaly of the same

organisation (and sometimes same individuals) overseeing the tests, developing the

tests and being responsible for regulating them. The review recommended that QCA

should consider the case for removing this responsibility for delivering the tests from

its remit and advise ministers on this. As a response, in 2004, the QCA mathematics

test development team was outsourced to Pearsons Education, a commercial

company, as a going concern under the governmental Transfer of Undertaking

(Protection of Employment) (TUPE) arrangements.

In 2004, in a further response, QCA also created an internal department, the

National Assessment Agency (NAA), which oversees the development and delivery

process for the tests, separate from its regulatory department. Indeed, in 2008, the

separation process is being taken further as the regulatory function is transferred to a

totally new agency, Ofqual (Office of the Qualifications and Examinations

Regulator).

From 1995, for key stages 2 and 3, the tests were externally marked as was

suggested in the 1993 curriculum review and forced by the legal action of NAS/

UWT. As with test development, this has been undertaken by external agencies. The

resources required for this exercise have meant it has been conducted by large

examination authorities, initially AQA then Edexcel and most recently a large US

organisation (through its European subsidiary, ETS Europe) was appointed from

2008 to 2012. The first year of this contract produced a crisis of delivery, which led to

many schools’ tests not being marked by the due date. This resulted in the

termination of the ETS contract and a major embarrassment for the government,

leading in part to a reconsideration of the nature of NCA, which is described in the

final section.

The marking process has, perhaps surprisingly, remained fairly constant over the

period. Marking is undertaken over a period of a few weeks by current or past

teachers, generally in their own homes. Markers receive training and there are checks

on their accuracy through re-marking of samples of their work. A complex system of

chief markers, their deputies and team leaders has operated in a slightly changing

fashion, but with the same underlying principles, that a ‘true standard’ lies in the

head of the chief marker, and that others must try to match this.

An evaluation of the first year of marking (SCAA 1995) showed it to be well

conducted and value for money, but doubts have gradually grown as the importance

of the results to schools have escalated. This has been seen most clearly in the

numbers of requests for remarking or review, which have steadily increased.

The National Curriculum testing system was an early adopter of openness in that

the marked papers are returned to schools for scrutiny. Newton and Whetton (2005)

describe the paradoxes and conflicts that arise from this. In principle, an appeals

system such as this is designed to add confidence to the system since it allows


Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

checking and the correction of errors. However, the disclosure of those errors instead

leads to a reduction of confidence in the system, rather than a recognition of the

importance of the safety net.

It is perhaps surprising that, as yet, the use of modern technology has not become

more prevalent in the marking process. On-screen marking of the tests, either in

centres or in a distributed manner by markers working at home, is technically

feasible and now no longer leading-edge. However, in relation to this, the

government and its agencies have been risk averse. This is apparently due to the

importance they attach to the security of publishing the results on time and

memories of two previous embarrassments, the failed attempt in 1998 to gather data

from scoring sheets and the late delivery of key stage 3 English test results to schools

in 2004. The latter occurred after a change in the specification involving a switch to

component marking, such that reading and writing were marked separately and then

the results recombined. QCA’s Board undertook its own investigation into this and

produced a damning report (Beasley et al. 2004) stating that poor management

decisions and operational inefficiencies led to delays in getting papers or results to

schools.

This is not to say that there have been investigations of pilots of the introduction

of marking technology for national curriculum tests, but evaluation of these did not

demonstrate that the systems were sufficiently robust to proceed to a full-scale

implementation (Whetton and Newton 2002).

At the time of writing, the delivery failure of marking in 2008 is still being

investigated but its reasons and consequences will continue to have reverberations in

future years.

Labour’s changes

In 1997, after 17 years of Conservative government, there was a landslide victory for

the Labour party. Famously, in one of his first speeches, the new prime minister,

Tony Blair, declared that he had three priorities: ‘education, education, education’.

This was part of an ambition to make Britain a better place. The Labour manifesto

of 1997 said ‘I believe Britain can and must be better: better schools, better hospitals,

better ways of tackling crime, of building a modern welfare state, of equipping

ourselves for a new world economy’ (quoted in Marr 2007, 512).

The problem became one of translating these general statements into a reality.

Certainly in education, the new government enthusiastically set about action. This

consisted of a stream of initiatives covering all aspects of school life. There was an

early success with literacy and numeracy strategies, which were perceived as being

successful, since there were increases in the proportions of children achieving literacy

targets (though these increases were later questioned, see below). The measure of the

success of these initiatives was often the outcomes of the national curriculum tests.

For schools, this meant that the publication of their own results came to have

greater and greater significance, but ministers also embraced the target culture,

setting targets for the country as a whole in terms of the proportion of students

attaining the expected level for their age in the national curriculum tests. Associated

with this, there was a subtle but important change in the nature of the meaning of

levels. TGAT and the early groups had defined the benchmark level 4 as being the

expected attainment of the average 11-year-old, but the target setting culture

amplified this to be the level all students were expected to achieve.

148 C. Whetton

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

This heightening of the stakes of the national curriculum tests were illustrated by

events in 1999. A newspaper article in the Daily Telegraph, a paper opposed to New

Labour, alleged that the cut-scores for the national curriculum tests in reading were

being deliberately lowered in order that the government’s targets could be met. This

was taken up by other parts of the media, so in a classical political damage limitation

manoeuvre, the secretary of state for education, David Blunkett, set up an independent

inquiry, headed by Jim Rose, deputy chief inspector at Ofsted, the schools inspectorate.

In his (audio-taped) memoirs, Blunkett (2006, 125) describes this as:

a silly story about whether the Key Stage 2 results on literacy and numeracy have beenaffected by the chopping and changing of the examiners, who had originally set easyquestions, realised the mistake in time and withdrew all the questions they consideredtoo easy – then ended up with tests that were so difficult compared with previous yearsthat there are accusations flying around from all angles.

This quotation illustrates the lack of understanding of assessment processes involved

in test development and standard setting, which is symptomatic of such debate, but

also the political importance of these processes.

In fact, the procedures for maintaining the cut-scores at the same level of ability

and knowledge for the tests, which change each year and are made public, has been a

constant source of difficulty for QCA and the developers involved. TGAT hardly

addressed this issue at all, assuming that criterion-referencing would be the solution:

‘Only by criterion-referencing can standards be monitored’ (DES and WO 1988,

para. 222). But from 1996 onwards, when the tests became more usual mark-based

instruments, the maintenance of a constant standard became an issue, which then

increased in importance as the stakes of the tests results rose politically and they

became measures of governmental success.

The methods used have evolved over the years and have also differed between the

subjects and key stages. In part, this was because of the different development

agencies involved, and in part, the differing nature of the subjects and their

components. Writing, for example, lends itself less well to item-based equating

methods. The types of processes that have been adopted fall into two groups – the

first are judgemental processes and the second statistical equating exercises.

In the early years, greater credence was placed on judgemental methods, and

variants of Angoff procedures (Angoff 1971) were often used. In these, teachers or

other curriculum experts scrutinised the test items closely and made a judgement

about the proportion of students at a level threshold who would answer correctly.

Later this was sometimes replaced by bookmarking processes (Mitzel et al. 2001).

These judgemental methods based on items alone have gradually declined in use, as

they have been found less useful and consistent with other information. However,

the alternative judgemental method of script scrutiny continues to be an important

part of the process. In this, a selection of completed scripts from the tests’ actual use

are scrutinised individually by a group of experienced markers and compared to the

scripts from previous years. This leads to a combined judgement of where the cut-

scores for each level should be.

Some form of statistical equating has been used consistently each year for the

national curriculum tests. The development process for the tests is such that each test

is available a year in advance of its use. It is then given (in a secure fashion) to

students who are about to take the current year’s test. The two sets of results can

then be equated, either using well-established whole-test methods (Angoff 1971) or


Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

through item response theory methods (Holland and Rubin 1982). An issue in these

processes has been what has come to be known as the pre-test effect. It became

apparent that children’s motivation and knowledge were less when taking the low-

stakes equating test than when doing the actual test (Pyle et al. 2009, this issue). This

effect had then to be allowed for judgementally in the decision process. In response

to this difficulty, a variant on equating has been the use of an anchor test, i.e. a test in

the subject area that is used over a number of years and equated each year with the

new test. This has the advantage of checking accumulated drift from the year-on-

year equating and of avoiding the pre-test effect, since both tests are taken under

low-stakes conditions.

Ultimately, the Rose Inquiry reported that there had been no ‘fiddling’ of the

results to present improved literacy and numeracy outcomes. This was accepted since

Blunkett was a better politician than psychometrician and had included opposition

appointees, a leading head teacher and the Times newspaper’s education

correspondent on the inquiry panel.

However, the Rose Inquiry was largely into whether proper procedures had been

followed in setting and marking the 1999 tests. There was a more fundamental

question raised by the political importance of the targets as measures of change over

time. This was whether the use of these procedures over an extended period had

maintained the required standard at a constant level. Following the Rose report, this

question was addressed by a large-scale study undertaken by Massey et al. (2003)

and funded by QCA. This research was conducted in Northern Ireland and they

administered the tests used in England from 1996 and a later year (1999, 2000 or

2001) to the same students. The tests were key stage 1 reading and mathematics; key

stage 2 English, mathematics and science; key stage 3 English, mathematics and

science. Massey et al. acknowledge the methodological difficulties of their approach

but believe that they were as successful as possible in obtaining conditions fair to

both versions being compared, and conclude that overall between 1996 and 2000,

standards in the national testing system were maintained or became marginally

stricter more often than they could be challenged. Nevertheless, for one high-profile

component, key stage 2 English, the experimental evidence indicated that a

significant proportion of the apparent improvement in national results may have

arisen from variation in test standards.

Such a conclusion would be supported by Tymms (2004), who challenged the

validity of the rise in proportion of children achieving the target level 4 in the key

stage 2 tests. This was a response to the claims of the government advisor who had

been responsible for many of the educational initiatives of the new Labour

government, first as chief adviser to the secretary of state for education on school

standards in and then in the policy office of the prime minister (Barber 2001).

Tymms considered the general pattern of changes, comparing the statutory test

data with the results from other studies and found clear rises in standards, but not

as strong as the official data suggested up to 2000. He agreed that the small

changes since 2000 were consistent. Tymms went further and raised this issue with

the Statistics Commission. This was an independent public body, set up in June

2000 to ‘help ensure that official statistics are trustworthy and responsive to public

needs’. (It has since been subsumed into the UK Statistics Authority.) Tymms’

concern was that the statistics used – key stage 2 test scores – were not suitable for

the purpose of monitoring trends in standards over a period of years. The

Statistics Commission (2005) concluded ‘that it has been established that

150 C. Whetton

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

(a) the improvement in key stage 2 test scores between 1995 and 2000 substantially

overstates the improvement in standards in English primary schools over that

period, but (b) there was nevertheless some rise in standards.’ However, it also

concluded that ‘we are aware of no particular fault with the procedures that QCA

now follow for maintaining test standards over time’ and further that ‘test scores

may not be an ideal measure of standards over time, but it does not follow that

they are a completely unsuitable measure for a [government] target. There is no

real alternative at present to using statutory test scores for setting targets for

aggregate standards.’

A first lessening in the testing regime occurred in 2005. As public pressure

mounted on ministers over the effects of testing, one strand of complaint was

concerned with the effects on children. It was apparent that the public (if not the

professional) view of formal testing of children at the ages of 11 and 14 was

acceptable. However, there were greater public concerns over the pressures on young

children arising from them being subjected to formal tests. In response, in 2004,

QCA organised a trial of a new approach to the key stage 1 assessment in which

more emphasis was placed on teacher assessment and the final judgement was that of

the teacher. Schools had, though, to first use nationally provided test or tasks, but at

a time of their own choice. The trial was evaluated by Leeds University (Shorrocks-

Taylor et al. 2004) and judged successful, so that from 2005, the new system was

adopted for all schools.

This was the first time since the early days of the testing regime that the testing

arrangements had been lightened, and the first return to teachers’ assessment taking

precedence over tests.

It was perhaps significant that this was for the key stage 1 arrangements, where

the tests served less of an accountability function, but this could nevertheless be seen

as something of a possible model for the future of NCA.

However, less optimistically, an NFER evaluation of the arrangements a year

later struck a note of caution. Only a minority of teachers were taking advantage of

the flexibility of the new arrangements; the majority were using the same tasks/tests

as previously and continuing to administer them during the summer term. There

were indications, however, that teachers intended to be more flexible, in choice of

tasks/tests and timing, in future (Reed and Lewis 2005).

Apart from this, from 2000 through to 2008, the NCA system was relatively

unchanging. However, this is not to say it was without criticism. These criticisms fall

into several groups:

. that the system has developed into having too many purposes so that it cannot

adequately serve them all (Newton 2007, identifies 18 ways in which the test

results are being used);

. that the accountability function puts too much pressure on schools;

. that the accountability function and the nature of the tests leads to a narrowing

of the curriculum;

. that the tests put too much pressure on children;

. that standards would be raised to a greater degree (and more widely and

validly) through an emphasis on ‘Assessment for Learning’.

During 2007 and 2008, these issues were examined in two different arenas: the

academic and the political. Under Robin Alexander, the Primary Review based at


Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

Cambridge University examined the condition and future of primary education in

England. The Primary Review commissioned academic surveys on all aspects of

primary education, including the assessment system. The paper dealing with this

concluded:

The current system of assessment in England provides information of only lowdependability whilst having some negative impacts on teaching and learning.Alternative systems need to be considered. One in which summative assessment isbased on teachers’ judgements would provide information that is more valid than testsand at least as reliable, but it would be necessary to avoid high stakes being attached tothe results by not using them for purposes other than reporting on individual pupils.For national monitoring, a regular sample survey, using a large bank of items, would

give far more information than is provided by results of individual pupils who have alltaken the same test. (Harlen 2007)

Other reviews by Wyse, McCreery and Torrance (2008) on the impact of national

reforms in curriculum and assessment on English primary schools, and Tymms and

Merrell (2007) on standards and quality in English primary schools over time offered

similar conclusions, from their own perspectives.

These surveys provided a powerful academic case against the current form of

national assessment system, but their critical tone has been perceived by government

as unhelpful, rather than constructive, which may have limited their immediate

influence, although their longer-term effects remain to be seen.

The other major academic input has been the promotion of Assessment for

Learning, which has had rather more success. The driver of this success was a small

number of academics acting together as a modern single-issue pressure group under

the banner of the Assessment Reform Group (Daugherty 2007). The starting point

was from the work of Paul Black and Dylan Wiliam (1998), which demonstrated that

improving formative assessment raises standards. Subsequent work was devoted to

clarifying the definition of Assessment for Learning and in proselytising on its

behalf. The outputs of the group have ranged from summary leaflets for teachers to a

full exposition of the practice and policy of using assessment to help and report

learning (Gardner 2006). The activities of the group have been extremely influential

and the phrase ‘Assessment for Learning’ has been widely adopted including by

government and its agencies. This culminated in June 2008, when the government

announced spending of £150 million on a programme calling itself ‘The Assessment

for Learning Strategy’ (DCSF 2008). However, as is often the case with a widespread

adoption of carefully nuanced principles, the mass messages may not have been what

the originators wished for.

The second, political, examination of NCA was that of the the Children, Schools

and Families Committee of the House of Commons, which undertook an enquiry

into Testing and Assessment (GB. Parliament. HoC. CSF Committee 2008). Both

the standards strand and the Assessment for Learning strand figure strongly in that

report. In a slight contrast to the academics’ general hostility to national testing for

the committee concluded that:

We consider that the weight of evidence in favour of the need for a system of nationaltesting is persuasive and we are content that the principle of national testing is sound.Appropriate testing can help to ensure that teachers focus on achievement and oftenthat has meant excellent teaching, which is very welcome. (Para. 25)

152 C. Whetton

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

However, echoing other views, the report went on to criticise the current system and

urge the government to find an alternative:

National test results are now used for a wide variety of purposes across many differentlevels – national, local, institutional and individual. Each of these purposes may belegitimate in its own right, but the question we have asked is whether the currentnational testing system is a valid means by which to achieve these purposes. Weconclude that, in some cases, it is not. In particular, we find that the use of national testresults for the purpose of school accountability has resulted in some schoolsemphasising the maximisation of test results at the expense of a more roundededucation for their pupils. A variety of classroom practices aimed at improving testresults has distorted the education of some children . . . We find that ‘teaching to thetest’ and narrowing of the taught curriculum are widespread phenomena in schools,resulting in a disproportionate focus on the ‘core’ subjects of English, mathematics andscience and, in particular, on those aspects of these subjects which are likely to be testedin an examination . . . We conclude that the national testing system should be reformedto decouple these multiple purposes in such a way as to remove from schools theimperative to pursue test results at all costs. (Summary)

Even earlier than the parliamentary committee report, from 2007, there seems to

have been a growing acceptance that the NCA system should change. For the

government, this had two motivations: to attempt to reduce the criticism of the

testing system from professionals and the media, and to reinvigorate attempts to

improve standards that they saw as having stalled, as indicated by the small increases

in proportions reaching the accountability targets from 2000 onwards.

The proposed solution was announced by Alan Johnson, the secretary of state for

education, in January 2007. This was known as the Making Good Progress pilot and

within this an assessment system based on teacher assessment and ‘single level tests’. As

ever, this was a political decision taken in haste without consultation or any detailed

planning. Rather the broad principles were announced and the problem handed over to

officials and statutory bodies who were charged with providing a workable system.

Making Good Progress (DfES 2007) stated that the government was interested in

‘exploring the impact of enabling teachers to enter a pupil for an externally marked

test as soon as they are confident (through their own systematic assessments) that the

pupil has progressed to the next level.’ The proposal went on to say that the tests

would be offered to schools at two points in the year and pupils could be entered

individually for a test that marks success at one level, and stimulates progress

towards the next level. They were to be no more burdensome than the current end-

of-key stage ‘multi-level’ tests and would:

generate the data on achievement that is so important for school accountability. Thesystem would be a one-way ratchet: once a pupil has passed a level, they will never goback, only forward. The model could be a powerful driver for progression, raisingexpectations for all pupils, motivating them, bringing a sharp focus on ‘next steps’ andperhaps especially benefiting those who start the key stage with lower attainment thantheir peers, or are currently making too little progress. Ultimately, these tests mightreplace end of key stage arrangements.

Since that announcement, the pilot has begun, the tests have begun to be developed

and, as for the SATs in the early days of the national curriculum, the model is being

refined in an attempt to make a broad conception into a working system. It remains

to be seen whether this can be achieved.


Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

It is important, though, to recognise what was not being given up here. Despite

the criticism from many of those within education of the distorting effects of

accountability, this remains a part of the proposals for the future. Similarly, the use

of short written tests would retain the possibility of a narrowing of the curriculum in

these subjects to the aspects being tested. In broad terms, what was being proposed

was not a change of strategy but a change of tactics.

This change of tactics took a further sudden step in October 2008, when (yet

another) secretary of state for schools, Ed Balls, unexpectedly announced the

cessation of National Curriculum testing at key stage 3, with immediate effect. This

included both the existing key stage tests and the single level tests, which would

continue to be piloted for key stage 2 only.

In making this announcement, the Secretary of State set out the government’s

view of three key principles for the continuation of the assessment system. He said it

should:

– First, give parents the information they need to compare different schools,

choose the right school for their child and then track their child’s progress;

– Second, enable head teachers and teachers to secure the progress of every child

and their school as a whole, without unnecessary burdens or bureaucracy;

– Third, allow the public to hold national and local government and governing

bodies to account for the performance of schools.

In the light of these principles, he stated that he had concluded that the key stage 3

testing was not justified. Parents could obtain information from GCSE results and a

new system of ‘real-time reporting of progress’ would be developed for tracking

individual pupils. In addition, there would also be an externally marked test with a

sample of pupils to measure national performance, holding the government to

account. The details of the new elements are yet to be determined, and an expert

group was set up to advise on them.

It is always difficult to predict the future, but it does seem that NCA in England is

about to change further, beyond even this announcement. It is not easy, though, to

predict what this change will be. The parliamentary committee accepted that there

should be some system of national testing, but wished it to be reformed. At present,

the government retains it wish to have an accountability measure at key stage 2, both

because of long memories of the lack of control in primary education before the

Education Reform Act and because target setting currently remains the manner in

which central government manages large state enterprises. It may be that an

alternative philosophy of government emerges, but this is not apparent at present.

Indeed, the key stage 3 announcement reinforced the belief in the current

philosophy.

It is a reasonable question to ask whether NCA has been successful but like many

simple questions, there is not one straightforward answer. A first clarification needs

to be – successful in terms of what criteria? NCA can be considered from a political

viewpoint, an educational viewpoint, an assessment viewpoint and even as an

operational process.

From a political viewpoint, overall the system can generally be seen as a success,

with some caveats. The original reasons for central government intervention in

education in the 1980s were to bring a measure of accountability to schools, to

control the curriculum and to improve standards of achievement. It is now the case

154 C. Whetton

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

that schools more readily acknowledge their responsibilities to their students, to

parents and to the community at large to a far greater extent. In general, they also

accept the need for some accountability measures, although perhaps not the current

ones. This has not come about only through NCA and the publication of results, but

also through an inspection system, originally frequent and stringent, and measures to

step in and manage or close failing schools. The testing system has contributed to

curriculum implementation and control through the concentration on English,

mathematics and science, ensuring their emphasis in schools. It has also helped to

give priority to particular subject elements from time to time. It is seen as having

raised standards (as elaborated in the government evidence to the Select Committee),

though the extent of this is disputed by others (Statistics Commission 2005). More

negatively from a political viewpoint, there was the correction to the process in the

1990s following the first introduction, but the political interpretation is that this was

a necessary sorting-out of the over-complexity created by the educational

professionals. Finally, a measure of success is that until 2008, there was no

widespread public demand for an end to the system, the clamour mostly arising from

educational professionals, who could be considered to be producer interests. The

failure of delivery of the marking agency in 2008 increased criticism of the system

and led to other aspects, such as marker reliability, being challenged. Hence, the

period of political success is coming to an end, and a new phase is required to

maintain the continuing objective of further improvement, without losing the other

aims of control and accountability.

From an educational viewpoint, the view is rather different – see the papers of the

Cambridge Review: Harlen (2007), Wyse, McCreery and Torrance (2008), Tymms

and Merril (2007). The pressure placed on schools through the accountability

mechanisms is considered counter-productive and the curriculum has been narrowed

both in terms of subjects studied and the nature of the learning process. The teaching

to the tests may consist of coaching in test-taking skills and emphasising only the

content of the tests, rather than the subject more broadly. This is regarded as

fostering ‘shallow learning’ rather than ‘deep learning’ (Gipps 1994, 23–4). It is also

believed that the momentum of raising standards would have been greater through

investment in formative or classroom assessment rather than a testing regime. In

summary then, the view of educational academics and teachers representatives is that

NCA has not been an educational success, and other lighter-touch mechanisms

would have been and are preferable.

Judged from an assessment viewpoint, the emphasis must be on validity with

reliability encompassed within this. Stobart (2009, this issue) and Newton (2009,

this issue) give their verdicts on these. The problem with arriving at an overall view

is that validity should arise from a clear statement of purpose, and this has been

difficult to determine for NCA. The original broad purposes of TGAT have been

supplemented by many others over the years. Newton (2007) identifies 18 different

purposes of the NCA system and the evidence to the Select Committee located

several more. Hence, the system as a whole meets these purposes to varying

extents, but because of their number achieves none perfectly. The key stage 3

announcement in October 2008 was a clear statement of the government’s priorities

among the many purposes. From the narrow viewpoint of the tests themselves,

there were 15 years in which a new style of high-quality curriculum-related testing

was established and many problems addressed, bringing test development in the

UK up to date.


Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

Finally, there is the organisational element. Certainly for QCA a successful year

has been considered to be one in which the tests are delivered to schools, collected

back, marked and the results collated and returned, to time, without loss or

complaint. In general this has happened – the complaints have not been

unmanageable and the losses relatively few. The events of 2008, with a failure of

delivery of the marking contract, though recent are the exception. From time to time,

there have been economic evaluations of the system and in general terms these tend

to show that the operations represent value for money. The costs are around £40

million per year or just over £2 per child tested. The average cost per school is

around £4000 per secondary school and £1000 per primary school. This could be

seen as good value for the effects on the system (if these are considered beneficial)

and can bear comparison with the costs of promoting teacher assessment (£150

million over three years announced in 2008) or moderation systems. Certainly, it

would be difficult to have a similar effect throughout a large system such as that in

England through other means for the same expenditure.

An overall evaluation of NCA has therefore to be mixed, since it has not

satisfied all stakeholders and any view of it depends on the viewpoint. It is a

system that has not been so bad that it has failed or been completely abolished, but

neither has it been so good that it is widely acclaimed. As such it is like many of

the modern public institutions of the UK (or more narrowly in this case, England),

which are of such a size, such an organisation and such complexity that they are

difficult to change and even more difficult to improve in a manner that gives all

stakeholders what they want. It has come to have a momentum that gives it

staying power, and even though there is at present a desire for change, it cannot be

certain that any new system will be necessarily better and without its own

unforeseen consequences. The next few years will determine the nature of the

change, and the years beyond that, whether this will come to be regarded as an

improvement. But that is another, future history.

Acknowledgements

I am grateful for the assistance and comments of Marian Sainsbury, Jo-Anne Baird, FrancesBrill and Felicity Fletcher-Campbell who all provided helpful comments to improve andclarify the arguments of this article.

Notes

1. ‘Thatcherite’ is a term used to encapsulate the policies and beliefs of Margaret Thatcher,prime minister of the UK from 1979 to 1990. ‘Thatcherism’ is characterised by decreasedstate intervention, monetarist economic policy, privatisation of state-owned industries,lower taxation and opposition to trade unions. Paradoxically, the intervention ineducation increased state control. Thatcher was leader of the Conservative political party,traditionally the party of the right.

2. ‘Blairite’ is a term used to describe the beliefs or followers of Tony Blair, British primeminister 1997 to 2007. Blair’s policies are not easy to encapsulate but did include thepromotion of quasi-markets into public services, including education. Blair was leader ofthe (New) Labour political party, traditionally the party of the left, though increasinglycentrist under Blair.

3. Ruskin College is an independent educational institution in Oxford, founded to provideopportunities for working class men and women, and with strong links to the TradeUnion movement. Its choice for this speech was therefore a carefully considered politicalstatement.

156 C. Whetton

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

4. The title of the chief finance minister of the UK government.5. A feature of the governmental organisation of the UK is the use of statutory bodies

to implement change or regulate markets. Such bodies are nominally independent andoutside government departments, but are nevertheless heavily controlled by them.

6. ‘Secretary of State’ is the title used in the UK for ministers of largegovernment departments. Often, Secretaries of State have several junior ministersworking under them. In the period described in this article, the Secretary of Stateresponsible for schools had a variety of other responsibilities in addition, with thesechanging frequently.

7. From this point on, the term SAT has never been used in an official publication.Nevertheless, it has proved to be the most long-lasting element of TGAT, and remains incommon (mis-used) parlance in public and press debates.

8. Page Heading: Test development, level setting and maintaining standards. www.qca.org.uk/qca.42021.aspx (accessed 30 July 2008). However, QCA’s website constantly changesits URLs.

References

Angoff, W.H. 1971. Scales, norms and equivalent scores. In Educational measurement, ed. R.L.Thorndike, 508–600. Washington, DC: American Council on Education.

Baker, K. 1993. The turbulent years: My life in politics. London: Faber & Faber.Barber, M., ed. 1996. The National Curriculum: A study in policy. Keele: Keele University

Press.Barber, M. 2001. The very big picture. School Effectiveness and School Improvement 12, no. 2:

213–28.Barnett, C. 1986. The audit of war: The illusion and reality of Britain as a great nation. London:

Macmillan.Beasley, M., E. Gould, S. Kirkham, and H. Emery. 2004. Report on Key Stage 3 English review

of service delivery failure 2003–2004 to QCA Board. London: QCA. http://www.qca.org.uk/libraryAssets/media/10343_ks3_en_report_04.pdf (accessed 31 July 2008).

Black, P. 1997. Whatever happened to TGAT? In Assessment vs evaluation, ed. CedricCullingford, 24–50. London: Cassell.

Black, P., and D. Wiliam. 1998. Inside the black box: Raising standards through classroomassessment. London: Kings College London, School of Education.

Blunkett, D. 2006. The Blunkett tapes: My life in the bear pit. London: BloomsburyPublishing.

Daugherty, R. 1995. National curriculum assessment: A review of policy 1987–1994. London:Falmer Press.

Daugherty, R. 2007. Mediating academic research: the Assessment Reform Group experience.Research Papers in Education 22, no. 2: 139–53.

Dearing, R. 1994. The National Curriculum and its assessment: Final report. London: SCAA.Department for Children, Schools and Families. 2008. The Assessment for Learning Strategy.

London: DCSF.Department for Education and Skills. 2007. Making good progress: How can we help

every pupil to make good progress at school? London: DfES. http://www.dfes.gov.uk/consultations/downloadableDocs/How%20can%20we%20help%20every%20pupil%20to%20make%20good%20progress%20at%20school.pdf (accessed 31 July 2008).

Department of Education and Science. Central Advisory Council for Education (England).1967. Children and their primary schools. Plowden report. London: HMSO.

Department of Education and Science and Welsh Office. 1988. National Curriculum: TaskGroup on Assessment and Testing. A report. London: DES.

Gardner, J., ed. 2006. Assessment and learning. London: Sage.Gipps, C., 1994. Beyond testing: Towards a theory of educated assessment. London: Falmer

Press.Gipps, C., M. Brown, B. McCallum, and S. McAlister. 1995. Intuition or evidence? Teachers

and national assessment of seven-year-olds. Buckingham: Open University Press.


Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

Graham, D., and D. Tytler. 1993. A lesson for us all: The making of the National Curriculum.London: Routledge.

Great Britain. Parliament. House of Commons. Children, Schools and Families Committee.2008. Testing and assessment volume I: Report, together with formal minutes. Third report ofsession 2007–08 (HC: 169-I). London: The Stationery Office.

Harlen, W. 2007. The quality of learning: assessment alternatives for primary education.Primary review research survey 3/4. Cambridge: Esmee Fairbairn Foundation. http://gtcni.openrepository.com/gtcni/bitstream/2428/29272/1/Primary_Review_Harlen_3-4_report_Quality_of_learning_-_Assessment_alternatives_071102.pdf (accessed 31 July 2008).

Holland, P.W., and D.B. Rubin, eds. 1982. Test equating. New York: Academic Press.Lawson, N. 1992. The view from No. 11: Memoirs of a Tory radical. London: Bantam.Marr, A. 2007. A history of modern Britain. London: Macmillan.Massey, A., S. Green, T. Dexter, and L. Hamnett. 2003. Comparability of national tests over

time: Key stage test standards between 1996 and 2001. Final report to the QCA of thecomparability over time project. London: QCA.

Miliband, D. 2003. Don’t believe the NUT’s testing myths. Times Educational Supplement4558, 14 November: 19.

Mitzel, H.C., D.M. Lewis, R.J. Patz, and D.R. Green. 2001. The bookmark procedure:Psychological perspectives. In Setting performance standards: Concepts, methods, andperspectives, ed. G.J. Cizek, 249–81. Mahwah, NJ: Erlbaum.

Newton, P. 2009. The reliability of results from National Curriculum testing in England.Educational Research 51, no. 2: 181–212.

Newton, P.E. 2007. Clarifying the purposes of educational assessment. Assessment ineducation: Principles, Policy & Practice 14, no. 2: 149–70.

Newton, P.E., and C. Whetton. 2005. The effectiveness of systems for appealing againstmarking error. Oxford Review of Education 31, no. 2: 273–91.

Pyle, K., E. Jones, C. Williams, and J. Morrison. 2009. Investigation of the factors affectingthe pre-test effect in national curriculum science assessment development in England.Educational Research 51, no. 2: 269–82.

Reed, M., and K. Lewis. 2005. Key Stage 1 education of new assessment arrangements.London: QCA. http://www.qca.org.uk/libraryAssets/media/pdf_05_18931.pdf (accessed31 July 2008).

Rose, J. 1999. Weighing the baby: The report of the independent scrutiny panel on the 1999 KeyStage 2 National Curriculum tests in English and mathematics. London: DFEE.

Ruddock, G., Ba. Tomlins, with K. Mason, B. Holding, M. Reiss, W. Keys, D. Foxman, andI. Schagen. 1993. Evaluation of national curriculum assessment in mathematics and scienceat Key Stage 3: The 1992 national pilot. Final report. Slough: NFER.

Sainsbury, M., ed. 1996. SATs the inside story: the development of the first national assessmentsfor seven-year-olds, 1989–1995. Slough: NFER.

School Curriculum and Assessment Authority. 1995. Evaluation of the external marking ofNational Curriculum tests in 1995. London: SCAA.

Shorrocks, D., S. Daniels, L. Frobisher, N. Nelson, A. Waterson, and J. Bell. 1992. Theevaluation of national curriculum assessment at Key Stage 1 (ENCA 1 Project): Finalreport. London: SEAC.

Shorrocks-Taylor, D. 1999. National testing: Past, present and future. Leicester: BPSBooks.

Shorrocks-Taylor, D., B. Swinnerton, H. Ensaff, M. Hargreaves, M. Homer, G. Pell, P. Pool,and J. Threlfall. 2004. Evaluation of the trial assessment arrangements for Key Stage 1.Report to QCA. Leeds: University of Leeds. http://www.qca.org.uk/libraryAssets/media/8994_leeds_uni__ks1_eval_report.pdf (accessed 31 July 2008).

Statistics Commission. 2005. Measuring standarkds in English primary schools. Report no.23.London: Statistics Commission.

Stobart, G. 2009. Determining validity in national curriculum assessment (Special Issue:National curriculum assessment in England: How well has it worked?) EducationalResearch 51, no. 2: 161–79.

Thatcher, M. 1993. The Downing Street years. London: HarperCollins.Tymms, P. 2004. Are standards rising in English primary schools? British Educational

Research Journal 30, no. 4: 477–94.

158 C. Whetton

Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

Tymms, P., and C. Merrell. 2007. Standards and quality in English primary schools over time:the national evidence. Primary Review Research Survey 4/1. Cambridge: Esmee FairbairnFoundation. http://www.primaryreview.org.uk/Downloads/Int_Reps/2.Standards_quality_assessment/Primary_Review_Tymms_Merrell_4-1_report_Standards_Quality_071102.pdf(accessed 31 July 2008).

Whetton, C., and P. Newton. 2002. An evaluation of on-line marking. Paper presented at the28th International Association for Educational Assessment Conference, September 3, inHong Kong SAR, China.

Wyse, D., E. McCreery, and H. Torrance. 2008. The trajectory and impact of national reform:curriculum and assessment in English primary schools. Primary review research survey 3/2.Cambridge: Esmee Fairbairn Foundation. http://www.primaryreview.org.uk/Downloads/Int_Reps/7.Governance-finance-reform/RS_3-2_report_Curriculum_assessment_reform_080229.pdf (accessed 31 July 2008).


Dow

nloa

ded

by [

Ope

n U

nive

rsity

] at

00:

30 2

9 A

ugus

t 201

4

assessment in England 1989–2008 A brief history of a ... · A brief history of a testing time:...

Documents

Transcript of assessment in England 1989–2008 A brief history of a ... · A brief history of a testing time:...