assessment in England 1989–2008 A brief history of a ... · A brief history of a testing time:...
Transcript of assessment in England 1989–2008 A brief history of a ... · A brief history of a testing time:...
This article was downloaded by: [Open University]On: 29 August 2014, At: 00:30Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: MortimerHouse, 37-41 Mortimer Street, London W1T 3JH, UK
Educational ResearchPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/rere20
A brief history of a testing time: national curriculumassessment in England 1989–2008Chris Whetton aa National Foundation for Educational Research , UKPublished online: 20 May 2009.
To cite this article: Chris Whetton (2009) A brief history of a testing time: national curriculum assessment in England1989–2008, Educational Research, 51:2, 137-159, DOI: 10.1080/00131880902891222
To link to this article: http://dx.doi.org/10.1080/00131880902891222
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose ofthe Content. Any opinions and views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be reliedupon and should be independently verified with primary sources of information. Taylor and Francis shallnot be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and otherliabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions
A brief history of a testing time: national curriculum assessment in
England 1989–2008
Chris Whetton*
National Foundation for Educational Research, UK
(Received 1 August 2008; final version received 5 December 2008)
Background: National curriculum assessment (NCA) in England has been inplace for nearly 20 years. It has its origins in a political desire to regulateeducation, holding schools accountable. However, its form and nature also reflecteducational and curriculum concerns and technical assessment issues.Purpose: The aim of the article is to provide a narrative account of thedevelopment and changes in NCA in England from its initiation to 2008 and toexplain the reasons for these.Sources of evidence: The sources quoted are in the public domain, but inaddition to academic articles, include political biographies and published officialpapers.Main argument and conclusions: NCA in England has evolved over 20 years,from an attempt at a criterion-referenced system based on tasks marked by thechildren’s own teachers through to an externally marked examination system.This change reflects the political purposes of the system for accountability, andthe pressure associated with this has led to growing criticism of the effects onchildren and their education. Nonetheless, the results provided are widely used bythe public and government, and the reasons for the survival of the system lie inboth its utility and the difficulty of identifying a new system which is necessarilyan improvement for all the stakeholders involved.
Keywords: assessment; testing; National Curriculum; SATs; assessment policy
Introduction
The history of national curriculum assessment (NCA) in England is a story with
several layers and several stages. The first layer is the detailed assessment system
itself – the tests, tasks and classroom assessments, together with their adminis-
trative, standard-setting and reporting arrangements. These have technical aspects
including their reliability, validity and manageability. The second layer is the wider
educational context of the curriculum (these are, after all, curriculum assessments),
schools’ practices and even the relationship between schools and society,
particularly parents. The third layer is the wider political and social milieu, which
must be seen in the context of the initial motive for the assessment system and
the reasons for its continuation. NCA was initiated by an Act of Parliament in
1988 and its structures, uses and instruments have been the subject of constant
political debate and decision. At times, this has involved personal interventions
*Email: [email protected]
Educational Research
Vol. 51, No. 2, June 2009, 137–159
ISSN 0013-1881 print/ISSN 1469-5847 online
� 2009 NFER
DOI: 10.1080/00131880902891222
http://www.informaworld.com
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
from the highest levels, with British prime ministers or their offices involved
and disagreements between those offices and the government’s education
department. This political layer must also be seen as incorporating the wider
social changes in Britain, and especially the initial Thatcherite1 distrust of
established professional interests and the later Blairite2 focus on the management
mechanisms of accountability and target setting as a means of effecting change in
public services.
A persistent aspect of the history is that the layers occasionally came into
collision. When this did happen, the language, values and beliefs of the three layers
differed and hence communication between them and understanding of each other
was limited.
The stages of the history have been: the initial establishment of a system; a
turbulent revolt and a simplification of that system; then a period of relative stability
but with a gradual increase in the uses and purposes of the assessment system and a
growth in the influence of the accountability purpose; and most recently, a search for
an alternative system.
There is a final introductory element to state. This is a ‘brief history’, so it is
inevitably selective. There is much more that could be written. As the author of this
‘brief history’, I have attempted to adopt the role of narrator. I should acknowledge
that I have been involved in various ways in each of the stages described, generally
as test developer or researcher. Consequently, I cannot but have my own views
and beliefs, which inevitably intrude into such a narrative. The personal involvement
means that there are some statements for which I have no written source quoted.
The sources quoted are in the public domain, but in addition to academic articles,
include political biographies and published official papers. I have not utilised media
sources (which could tell another story) except as referred to in the referenced
documents.
Although I have attempted is to be a neutral narrator, a degree of bias is
inevitable and others will differ in their interpretation of events. This is the nature of
a history and if this piece provokes discussion and other views, it will have
succeeded.
The pre-cursors
The starting point for the action of NCA was the return of Margaret Thatcher for
her third term of office in 1987. The Conservative manifesto had included a
promise of educational reforms and the new government set about introducing
these. This was not, however, a sudden whim. Dissatisfaction within governments
with education had been growing since the 1970s, as set out in Michael Barber’s
(1996) The National Curriculum: a study in policy, and other political accounts (e.g.
Graham and Tytler 1993). These sources point to several important factors. The
first was a growing discomfort with the adherence to child-centred approaches in
primary schools following the Plowden report of 1967 (DfES 1967). This
culminated in the scandal of the William Tyndale School, which received great
media publicity in 1975. In that school in London, the principal and teachers of
the school took child-centeredness to extremes, defying any attempts at control by
parents or the local authority. This became a symbol of the seemingly total absence
of accountability in schools at the time. A second factor was the growing
realisation of the changing educational needs of Britain, which followed from the
138 C. Whetton
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
economic upheavals from the 1970s onwards, towards the continuing globalisation
of the present. The economic crisis of the mid-1970s was one of the prompts for
the first public governmental statement of concerns, by the then (Labour) prime
minister, James Callaghan, in 1976 at Ruskin College.3 As he often did, he
attempted to steer a middle way, emphasising the high standards of the past but
referring to the need not just to maintain standards but to improve them. He
recognised the professionalism and vocation of teachers but lamented complaints
from industry that recruits did not have the basic tools required. He alluded to the
resources that had been injected into education and the £6 billion spent on it, and
almost invited the teaching profession to put its house in order ‘For if the public is
not convinced then the profession will be laying up trouble for itself in the future.’
This warning was not heeded.
Callaghan soon departed from power, and the country moved into 18 years of
Conservative rule, initially under Margaret Thatcher. Her first two governments
were concerned principally with areas other than education, but ministers, for
example the Chancellor of the Exchequer4 (see memoirs of Nigel Lawson, 1992),
were heavily influenced by the historical analysis of Correlli Barnett (1986), whose
The audit of war described the decline in British power during the twentieth century,
and attributed this in part to the failure of education.
This background remains of importance in current debates about NCA.
Accountability in education is now a genie that is out of the bottle and will have
to continue in some form. The belief in the link between educational standards and
economic success as a country has grown stronger. Twenty or more years after the
concerns outlined here a government minister from a different party was still
alluding to the 40 years of static standards, contrasting it with rises in the 1990s
(Miliband 2003).
In their third term in 1987, the Conservatives turned their attention to education.
Margaret Thatcher in her biography says:
The starting point for the education reforms . . . was a deep dissatisfaction (which I fullyshared) with Britain’s standard of education. There had been improvement in the pupil–teacher ratio and real increases in spending per child. But increases in public spendinghad not by and large led to higher standards. (Thatcher 1993, 590)
And so in 1988, the Education Reform Act established by law that there should be a
national curriculum and provided that, as an integral element of this, there should be
‘arrangements for assessing pupils at or near the end of each key stage for the
purpose of ascertaining what they have achieved in relation to the attainment targets
for that stage’ (Education Reform Act, 1, 2: (2)).
The term ‘key stage’ was a new one to the education system in England. There
were four key stages, which were: five to seven years old (key stage 1); eight to 11
years old (key stage 2); 12 to 14 years old (key stage 3); and 15 to 16 years old (key
stage 1). Also introduced was a common labelling to describe the years of education,
from Year 1 (broadly six-year-olds) to Year 11 (broadly 16-year-olds in the last year
of compulsory education). Prior to this, there had been many organisational patterns
in different parts of England. One effect of the Act and the testing system was to
bring greater coherence to the system as a whole and to provide a clear demarcation
between the end of primary schooling (key stage 2) and secondary education in key
stages 3 and 4.
Educational Research 139
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
The early developments
The suddenness of the decision and the politically driven need for speed of
implementation meant that there was no existing model for putting these
arrangements into place.
The Education Reform Act specified that there should be an assessment system
with testing at or about the end of each of the four key stages. The form of this
testing, the uses of the results and all other details were left open to be decided by the
Secretary of State for Education. Several accounts of this first implementation have
been published. These give the story from the viewpoints of various participants in
the events. Among them are: the politician’s story (Baker 1993); the chief executive’s
story (Graham and Tytler 1993); the system architect’s story (Black 1997), the
Council member’s story (Daugherty 1995); the union official’s story (Barber 1996)
and finally the test developer’s story (Sainsbury 1996). In addition, there were
evaluation projects examining different aspects of the process (e.g. Shorrocks et al.
1992; Ruddock and Tomlins 1993; Gipps et al. 1995). The fullest account, which
merges the three levels of political detail, educational considerations and technical
understanding, is Shorrocks-Taylor (1999).
Overall responsibility for the tests has rested with a succession of statutory
bodies,5 initially the Schools Examination and Assessment Council (SEAC) created
by the 1988 Education Reform Act. In 1993, as part of the response to the crisis of
that year, this merged with the National Curriculum Council to become the School
Curriculum and Assessment Authority (SCAA), which itself merged with the
vocational assessment agency (the National Council for Vocational Qualifications)
to become the Qualifications and Curriculum Authority (QCA) in 1997. During
2008, it was announced that QCA itself would be broken up, with its regulatory
function passing to a new agency.
Each of these statutory bodies had its own particular approach and each was
influenced by the politics of its time as well as the views of its senior staff. This in part
accounts for the lack of coherence in the system as a whole, which will become
evident later in this account.
The blueprint
An initial blueprint for the assessment system was provided by a group known by the
acronym TGAT. This stood for Task Group on Assessment and Testing and was
composed largely of educationalists. The group was chaired by Paul Black, an
eminent science educator, and consisted of two directors of research or assessment
organisations, two head teachers, a chief education officer, an ex-chief HMI for
Primary Education, an economic researcher, an emeritus professor of electronics and
a personnel director of an engineering firm. The group immediately set about
devising a system for the curriculum and its assessment. Their task was not an easy
one. The terms of reference of the group were to advise on:
. . . the practical considerations which should govern all assessment including testing ofattainment at age approximately 7, 11, 14 and 16, with a national curriculum; includingthe marking scale or scales and kinds of assessment including testing to be used, theneed to differentiate so that the assessment can promote learning across a range ofabilities, the relative role of informative and of diagnostic assessment, the uses to whichthe results of assessment should be put, the moderation requirements needed to secure
140 C. Whetton
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
credibility for assessments, and the publication and other services needed to support thesystem – with a view to securing assessment and testing arrangements which are simpleto administer, understandable by all in and outside the education service, cost-effective,and supportive of learning in schools. (DES and WO 1988)
TGAT proposed an assessment system to meet five distinct purposes. These were
that it should be:
. formative – providing information on where a pupil is, enabling teachers to
plan the next stages;
. summative – providing overall information on the achievement of pupils;
. evaluative – providing aggregated information on classes and schools to assess
curriculum issues, as well as the functioning of teachers and schools;
. informative – providing information to parents about their own children and
general information about the school;
. for professional development – giving teachers greater sophistication in
assessment, recording and monitoring so that they can evaluate their own
work.
To meet these objectives, TGAT proposed an innovative assessment system (DES
and WO 1988). Attainment was to be measured on a continuous scale of 10 levels,
covering the entire five to 16 age range. This 10-level scale was to be criterion
referenced with each level defined by a set of criteria. Pupils were to attain a level by
demonstrating the performance set out in these criteria.
The full TGAT scheme had many elements and ramifications. It was initially
hailed as a success because it appeared to meet both the government’s desire for a
system providing accountability and the teachers’ desire for a system based on
professional judgement and diagnostic assessment.
Immediately though, there were political problems. Margaret Thatcher found the
proposals:
a weighty, jargon filled document in my overnight box with a deadline for publicationthe following day. The fact that it was then welcomed by the Labour party, TheNational Union of Teachers and the Times Educational Supplement was enough toconfirm for me that its approach was suspect. It proposed an elaborate and complexsystem of assessment – teacher dominated and uncosted. (Thatcher 1993, 594)
In a sense this was correct, as the seeds of trouble were sown by TGAT or the
manner in which it was implemented, though Paul Black continued to defend the
principles for many years (e.g. Black 1997).
The TGAT model as implemented in terms of curriculum structure was such that
for each level of each attainment target there were between one and 10 statements of
attainment that defined it and that required assessment in some way. There were also
many attainment targets for each subject. As a consequence, across a subject, there
were a large number of statements. The attainment targets in each subject and the
statements of attainment defining each level were determined by separate working
groups of experts in each subject. Thus the criteria did not represent empirical
findings about pupils’ attainments, but rather the judgements of the working groups
comprising experts and interested parties. They were aspirational targets in the sense
of what ought to be achieved rather than being defined in any research-based
Educational Research 141
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
empirical sense. They also differed between subjects, as each reflected its own
structures and assumptions. In addition, adding to the incoherence, there was a sort
of competition for the curriculum, as subject groups expanded their own subject to
ensure its importance as part of the whole. Science, for example, ‘stole’ earth science
from the geographers.
Key stage 1 and standard assessment tasks (SATs)
The assessment of children at the end of key stage 1, i.e. for seven-year-olds, was the
first part of the assessment regime to be put into place. This was largely an accident,
since this was the key stage that was shortest and therefore required end of key stage
assessments first. In many ways this was unfortunate, since the assessment of
children at this age was the most unknown area and also the most problematic.
Established assessments of children at this age used individual administration or
were relatively informal, but for the first time, these were having to be applied to a
compulsory universal testing regime. The tasks initially involved three subjects
(mathematics, English and science), covering a sample of nine attainment targets
leading to over 100 statements of attainment, which had to be assessed for each
child. For teachers with large classes, this multiplied up to thousands of judgements
to be made and recorded. Other subjects were to be added subsequently.
In 1989, when the National Curriculum was beginning to be introduced, three
separate development agencies were commissioned to develop their own model of
standard assessment tasks for seven-year-olds (known at the time as SATs) following
the blueprints given in the TGAT report. These SATs were to be large-scale, cross-
curricular tasks and break new ground in assessment. The three teams were:
. Consortium for Assessment and Testing in Schools (CATS), which comprised
the Institute of Education, London East Anglian Group for GCSE and
Hodder & Stoughton Publishers;
. Standard Tests and Implementation Research (STAIR), based at Manchester
University and including several local education authorities;
. NFER/BGC, which comprised NFER, Bishop Grosseteste College, Lincoln,
NFER-Nelson Publishing Company, and West Sussex and Sheffield Local
Education Authorities (LEAs).
In order to attempt to deal with the vast number of assessments that had to be made,
the three agencies adopted different strategies. CATS assessed every attainment target,
but not every statement. NFER sampled the attainment targets but attempted to assess
every statement within those selected. STAIR made a valiant attempt to assess every
statement of each attainment target, producing small mountains of paper-based tasks.
All the agencies ran trials in 1990, which demonstrated that none of the models was in
any way manageable, and so SEAC produced a new specification for the first full pilot
in 1991, asking all three agencies to re-tender against this. The NFER consortium was
successful and proceeded to the first full implementation. The new specification
required schools and teachers to assess a selection of attainment targets, some of which
were compulsory and some which could be selected from a range of options.
This early development illustrated well the tensions between: authenticity in the
assessment as representing the curriculum (and how children respond to it); the
requirements of accountability and reliability; and manageability in terms of
142 C. Whetton
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
teachers’ time, their classroom control and children’s time (Sainsbury 1996). The
initial statutory attempt to meet these tensions was the 1991 SATs. These tasks were
designed to reflect quality work in infant classrooms and featured assessments of
investigations taking place with small groups of children. Reading was assessed
through an individual interview in which a passage from one of 27 specified books
had to be used. For the assessment of the attainment targets, each statement of
attainment was interpreted in the context of the specific activity, giving a basis for
teachers to make their judgements of children. In order to reduce the number of
assessments needing to be made, a crude form of tailored testing operated. When the
first assessment took place in primary schools across the country, there was a great
outcry about the assessment load. The work involved class teachers for too long – an
average of around 44 hours whereas the specification had been for 30 hours of
teacher time. Children’s education was disrupted for far too long. The reliability of
the SATs was criticised, since many of the tasks involved individualistic interactions
with children, and also because teachers found it difficult to assess children working
collaboratively. The SATs had attempted to meet some of the principles of sound
classroom-based educational assessment by providing a means whereby all children
could show their best, but this was unmanageable in a mass testing system.
However, there were positive outcomes, which were largely associated with the
effects on the curriculum and on teachers’ knowledge and skills in assessment. For
the first time ever, scientific and mathematical investigations were taking place in
every infant classroom and teachers were involved in making assessments of
children’s work in these areas. LEA advisers claimed that they had striven for years
to achieve this and had not succeeded. Compulsory authentic assessment had an
immediate effect. But for many, the gains in beneficial effects on the curriculum were
outweighed by the manageability and reliability arguments (Gipps et al. 1995).
The key stage 1 experience had proved very difficult, but it might have succeeded.
The primary teachers involved were coming to grips with the nature of the tasks and
were generally willing to try. The same was not the case for the next key stage to
come on stream: the testing in secondary schools of 14-year-olds. The same model of
development was followed, with a number of different agencies following differing
models, but still classroom tasks as advocated by TGAT. However, a series of crises
changed all that.
Key stage 3 and the crisis
In 1990, Kenneth Clarke became secretary of state6 for education in the final few
weeks of Margaret Thatcher’s government. He was reappointed by John Major, the
new (still Conservative) prime minister, and began to engage with the issues of the
curriculum and its assessment. Clarke liked to be regarded as a plain-speaking man,
and on being shown the pilot key stage 3 assessment materials, he publicly
denounced them as ‘elaborate nonsense’. The specifications were rapidly rewritten,
the contracts of the development agencies terminated and policy moved to the use of
‘written terminal examinations’ rather than the SATs.7 In English, the new agency,
Northern Examinations Association, lasted less than a year before it too was
replaced, this time by the agency that would undertake the work for over a decade,
the University of Cambridge Local Examination Syndicate (UCLES).
The new development agencies attempted to develop the new instruments very
quickly, but were hampered by being caught in a no-man’s-land between formal
Educational Research 143
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
testing and the criterion-referenced nature of the curriculum. The items within the
tests remained tied to the statements of the attainment targets and elaborate rules
were needed to aggregate the item results to the award of a level. These created
anomalies and complexities that endangered the reliability of the results awarded
(Ruddock and Tomlins 1993).
The real battleground though was the tests of English at key stage 3. This was
perhaps the first overt clash between the educational and the political layers. With
hindsight, a crisis was unavoidable, given a combination of hubristic political
involvement, the competing statutory agencies responsible (one for curriculum and
another for assessment), the lack of continuity of development agencies and the
speed of implementation. The crisis had the inevitability of a Shakespearian tragedy
and it was perhaps fitting that Shakespeare’s plays should be the centre of events.
Secondary English teachers proved to be the least pliant members of their
profession and the decisions of yet another Secretary of State, John Patten, enraged
them further. The 1993 tests were to include the curriculum subject of English for the
first time, but unlike science and mathematics, there had been no large pilot of the
English tests in 1992. English teachers did not know what to expect and began to
agitate. Patten then announced that there would be a compulsory test on one of three
specified Shakespeare plays. While the teaching and testing of Shakespeare was
objected to in principle by some, the majority of English teachers felt it could not be
fair without a reasonable time to prepare, which the late announcement did not
allow.
Pressure for a boycott of the tests grew, and the teachers’ unions responded by
balloting their members on the issue. In the National Union of Teachers ballot,
over 90% said they would support a boycott. Other teachers’ unions joined the
revolt but the inept Patten and the head of SCAA, Lord Griffiths, stood firm. Lord
Griffiths even declared that the tests were the best prepared in the history of
education, a statement manifestly untrue. Nevertheless, the government did fight
back, through a tame Local Education Authority, Wandsworth. This took the
form of legal action in the High Court against a teachers’ union, the National
Association of Schoolmasters/Union of Women Teachers (NAS/UWT), seeking to
prevent the boycott of the tests. The union, however, won the case, not on the
grounds of the underprepared tests or their educational value, but on the grounds
of its members’ workload. Teachers could not be required to undertake the extra
unpaid work of marking the tests. The government’s policy was in disarray and, to
all intents and purposes, the 1993 tests did not take place. This legal decision had
important consequences for the later structure of NCA and, unknowingly, for
future crises.
A white knight was required to rescue the system, and this took the form of Sir
Ron Dearing. The Dearing review streamlined the curriculum, decided (after some
consideration) to retain the level-based system but reduced this to eight levels, and
reduced markedly the requirements for record keeping in support of teacher
assessment, which was retained and granted a notional equal status with the test
results. The tests themselves were streamlined and changed in focus from ostensibly
criterion-referenced instruments to mark-based tests. The need for standard setting
and maintenance procedures, which became major issues later, came about as a side
effect of this. Finally, the tests were in future to be externally marked, resolving the
workload issue of the Wandsworth case (Dearing 1994). This response, forced by the
teachers, paradoxically took the system further from one of the original educational
144 C. Whetton
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
objectives of TGAT, that of the teachers’ marking their own students work and
deriving useful feedback from the process.
In general terms, in the five years after 1991, the balance between validity,
manageability and reliability gradually (and at times rapidly) swung away from
validity toward manageability and reliability. This became inevitable once the
teachers’ unions, which organised a boycott of the assessments in 1993, had
determined that the paramount issue was manageability. The use of the results for
accountability purposes – the publication of school results from 1996 – also pushed
in the direction of reliability through standardisation. The advocates of curriculum
authenticity could not stand against these arguments. The trend was therefore
toward a reduction in the number of subjects tested (science was no longer included
for seven-year-olds) and in the aspects involved (any attainment targets involving
practical elements were generally excluded).
In response to widespread criticism, the curriculum had been reviewed and
reduced in size and structure as part of the Dearing reforms (Dearing, 1994). The
attainment targets themselves were altered in structure from the statements of
attainment to a more broadly based hierarchy of level descriptions, which were
intended for the teacher to find a best fit to the child’s attainments. The number of
levels was reduced from 10 to eight. It was a model that based assessment on direct
global judgement rather than on separate elements that had then to be aggregated.
As such, it was well suited to teacher assessment, but it became the case that test
developers had generally to explicitly redefine elements in order to arrive at questions
for the National Curriculum tests.
An interesting question to ponder is why the dangers of criterion referencing were
not recognised at the outset of the National Curriculum. Some aspects were, in that
the dangers of minimum competency tests were avoided with the progressive level-
based system. However, the problems of large numbers of criterion-referenced
statements were ignored. This did not seem to influence either TGAT or the subject
working groups. By the time test developers came on the scene, the statements of
attainment were established facts and the direction given to development agencies
was that these must be individually assessed. It should not have taken the expensive
lessons of 1993 to have realised the impossibility of the early curriculum and
assessment model.
The Dearing reforms effectively set the form of the NCA system in place for the
next 12 years. There have been other perturbations and flurries of political interest,
but these were very minor compared with the 1993 crisis. If NCA has a hero, it has to
be Ron Dearing, and a grateful government ennobled him in 1998 for this and other
services to education.
Key stage 2 tests
In contrast to the key stage 1 and key stage 3 tests, those at key stage 2 did not go
through an initial stage of being open-ended tasks, then changing specification and
style. By the time the development of the tests for 11-year-olds began in 1994, the
specification was firmly for timed written tests. These were introduced in
mathematics, science and English. After initial uncertainty, reading and writing
were separately reported but a combined English level was also calculated. In all
subjects, the suite of tests had assessment tasks for levels 1 and 2 (which became
optional from 1997), then written tests for levels 3 to 5, which covered the majority
Educational Research 145
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
of children. Initially there was an additional level 6 test, requested by ministers to
encourage gifted pupils, but this was dropped after 2002. Otherwise, in general, the
tests remained similar across the years. For mathematics, a test of mental arithmetic
was introduced in 1998.
In broad terms, judged as written tests of a part of the curriculum, the key stage 2
tests are good examples, which stand up to scrutiny (Rose 1999). However, there are
aspects of the curriculum that are formally omitted, such as speaking and listening in
English, investigations in mathematics and practical investigative science. As such,
they are only a partial measure of the full subject curriculum.
The making of the tests
As outlined above, the political expediency for the tests brought them into being, but
somehow this had to be managed and delivered. This was hardly discussed in TGAT,
which covered it in one paragraph, without making any recommendations.
Starting from nothing, a system had to be created to develop tests for three age
groups in several subjects, get these printed for about 1.8 million children, delivered
to about 25,000 schools, collected back again, marked reliably and then the results
collected, collated and published. Stated in these terms, it can be seen to be a
considerable logistical exercise, of the sort that has many possibilities for going
wrong. That it has happened each year until 2008, generally without large disasters
(although some small ones) can be seen as a success for those involved. Whatever the
views about the quality or effects of the tests, the system has largely worked. This is
all the more remarkable when the number of organisations involved is considered.
In large part, the chosen method of managing this complex system has been to
divide it into elements and to contract agencies expert in the particular functions to
undertake these, with the statutory body (SEAC, SCAA or QCA sequentially)
overseeing the whole process. Generally, the contracts have been for a span of three
or four years but there have been instances of earlier terminations or longer
contracts. It is difficult and would be tedious to describe the whole process and there
are few sources of description of the full complexity.
The test development process of 1999 is described in Rose (1999, Appendix 3)
and the current procedure is available on QCA’s website.8
Broadly speaking, the development process follows a classical pattern. In
simplified terms, items (questions) are written by subject matter experts, then formed
into several trial versions, which are administered to substantial numbers of students
with the responses being analysed through a variety of statistical item analysis
methods. A final version of the test (and a reserve test) are constructed from this
information and, in a second pre-test a year ahead of its actual use, the test is
administered in secure conditions to students at about the same time that they take
the current year’s test. Some students also take an unchanging anchor test. These
processes are designed to enable the new tests to be equated to the standards of
previous years. There are also a considerable number of review processes from
curriculum experts and other specialists in equality and diversity. Special
arrangements and modifications are produced for pupils with various special needs
taking the tests and these are described in Modified test administrators’ guides
available from QCA each year.
The agencies involved in test development have been university departments,
examination boards, research institutes and, for a period, QCA itself. There has been
146 C. Whetton
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
a narrowing of suppliers over the years, with academic institutes largely withdrawing
and only specialist assessment agencies remaining. This can probably be seen as a
response to the increasing stringency of the contracts, which came to contain many
penalties and other legal requirements, and to the difficulty of aspects of the
development process.
In an early sign of this narrowing of suppliers, in 1998, when the contracts for the
2000–03 tests were being let, QCA decided to take the mathematics development
work in house. It recruited and created an internal team to undertake the
development of the mathematics tests for all three key stages. However, the
quinquennial review of QCA in 2002 pointed to the anomaly of the same
organisation (and sometimes same individuals) overseeing the tests, developing the
tests and being responsible for regulating them. The review recommended that QCA
should consider the case for removing this responsibility for delivering the tests from
its remit and advise ministers on this. As a response, in 2004, the QCA mathematics
test development team was outsourced to Pearsons Education, a commercial
company, as a going concern under the governmental Transfer of Undertaking
(Protection of Employment) (TUPE) arrangements.
In 2004, in a further response, QCA also created an internal department, the
National Assessment Agency (NAA), which oversees the development and delivery
process for the tests, separate from its regulatory department. Indeed, in 2008, the
separation process is being taken further as the regulatory function is transferred to a
totally new agency, Ofqual (Office of the Qualifications and Examinations
Regulator).
From 1995, for key stages 2 and 3, the tests were externally marked as was
suggested in the 1993 curriculum review and forced by the legal action of NAS/
UWT. As with test development, this has been undertaken by external agencies. The
resources required for this exercise have meant it has been conducted by large
examination authorities, initially AQA then Edexcel and most recently a large US
organisation (through its European subsidiary, ETS Europe) was appointed from
2008 to 2012. The first year of this contract produced a crisis of delivery, which led to
many schools’ tests not being marked by the due date. This resulted in the
termination of the ETS contract and a major embarrassment for the government,
leading in part to a reconsideration of the nature of NCA, which is described in the
final section.
The marking process has, perhaps surprisingly, remained fairly constant over the
period. Marking is undertaken over a period of a few weeks by current or past
teachers, generally in their own homes. Markers receive training and there are checks
on their accuracy through re-marking of samples of their work. A complex system of
chief markers, their deputies and team leaders has operated in a slightly changing
fashion, but with the same underlying principles, that a ‘true standard’ lies in the
head of the chief marker, and that others must try to match this.
An evaluation of the first year of marking (SCAA 1995) showed it to be well
conducted and value for money, but doubts have gradually grown as the importance
of the results to schools have escalated. This has been seen most clearly in the
numbers of requests for remarking or review, which have steadily increased.
The National Curriculum testing system was an early adopter of openness in that
the marked papers are returned to schools for scrutiny. Newton and Whetton (2005)
describe the paradoxes and conflicts that arise from this. In principle, an appeals
system such as this is designed to add confidence to the system since it allows
Educational Research 147
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
checking and the correction of errors. However, the disclosure of those errors instead
leads to a reduction of confidence in the system, rather than a recognition of the
importance of the safety net.
It is perhaps surprising that, as yet, the use of modern technology has not become
more prevalent in the marking process. On-screen marking of the tests, either in
centres or in a distributed manner by markers working at home, is technically
feasible and now no longer leading-edge. However, in relation to this, the
government and its agencies have been risk averse. This is apparently due to the
importance they attach to the security of publishing the results on time and
memories of two previous embarrassments, the failed attempt in 1998 to gather data
from scoring sheets and the late delivery of key stage 3 English test results to schools
in 2004. The latter occurred after a change in the specification involving a switch to
component marking, such that reading and writing were marked separately and then
the results recombined. QCA’s Board undertook its own investigation into this and
produced a damning report (Beasley et al. 2004) stating that poor management
decisions and operational inefficiencies led to delays in getting papers or results to
schools.
This is not to say that there have been investigations of pilots of the introduction
of marking technology for national curriculum tests, but evaluation of these did not
demonstrate that the systems were sufficiently robust to proceed to a full-scale
implementation (Whetton and Newton 2002).
At the time of writing, the delivery failure of marking in 2008 is still being
investigated but its reasons and consequences will continue to have reverberations in
future years.
Labour’s changes
In 1997, after 17 years of Conservative government, there was a landslide victory for
the Labour party. Famously, in one of his first speeches, the new prime minister,
Tony Blair, declared that he had three priorities: ‘education, education, education’.
This was part of an ambition to make Britain a better place. The Labour manifesto
of 1997 said ‘I believe Britain can and must be better: better schools, better hospitals,
better ways of tackling crime, of building a modern welfare state, of equipping
ourselves for a new world economy’ (quoted in Marr 2007, 512).
The problem became one of translating these general statements into a reality.
Certainly in education, the new government enthusiastically set about action. This
consisted of a stream of initiatives covering all aspects of school life. There was an
early success with literacy and numeracy strategies, which were perceived as being
successful, since there were increases in the proportions of children achieving literacy
targets (though these increases were later questioned, see below). The measure of the
success of these initiatives was often the outcomes of the national curriculum tests.
For schools, this meant that the publication of their own results came to have
greater and greater significance, but ministers also embraced the target culture,
setting targets for the country as a whole in terms of the proportion of students
attaining the expected level for their age in the national curriculum tests. Associated
with this, there was a subtle but important change in the nature of the meaning of
levels. TGAT and the early groups had defined the benchmark level 4 as being the
expected attainment of the average 11-year-old, but the target setting culture
amplified this to be the level all students were expected to achieve.
148 C. Whetton
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
This heightening of the stakes of the national curriculum tests were illustrated by
events in 1999. A newspaper article in the Daily Telegraph, a paper opposed to New
Labour, alleged that the cut-scores for the national curriculum tests in reading were
being deliberately lowered in order that the government’s targets could be met. This
was taken up by other parts of the media, so in a classical political damage limitation
manoeuvre, the secretary of state for education, David Blunkett, set up an independent
inquiry, headed by Jim Rose, deputy chief inspector at Ofsted, the schools inspectorate.
In his (audio-taped) memoirs, Blunkett (2006, 125) describes this as:
a silly story about whether the Key Stage 2 results on literacy and numeracy have beenaffected by the chopping and changing of the examiners, who had originally set easyquestions, realised the mistake in time and withdrew all the questions they consideredtoo easy – then ended up with tests that were so difficult compared with previous yearsthat there are accusations flying around from all angles.
This quotation illustrates the lack of understanding of assessment processes involved
in test development and standard setting, which is symptomatic of such debate, but
also the political importance of these processes.
In fact, the procedures for maintaining the cut-scores at the same level of ability
and knowledge for the tests, which change each year and are made public, has been a
constant source of difficulty for QCA and the developers involved. TGAT hardly
addressed this issue at all, assuming that criterion-referencing would be the solution:
‘Only by criterion-referencing can standards be monitored’ (DES and WO 1988,
para. 222). But from 1996 onwards, when the tests became more usual mark-based
instruments, the maintenance of a constant standard became an issue, which then
increased in importance as the stakes of the tests results rose politically and they
became measures of governmental success.
The methods used have evolved over the years and have also differed between the
subjects and key stages. In part, this was because of the different development
agencies involved, and in part, the differing nature of the subjects and their
components. Writing, for example, lends itself less well to item-based equating
methods. The types of processes that have been adopted fall into two groups – the
first are judgemental processes and the second statistical equating exercises.
In the early years, greater credence was placed on judgemental methods, and
variants of Angoff procedures (Angoff 1971) were often used. In these, teachers or
other curriculum experts scrutinised the test items closely and made a judgement
about the proportion of students at a level threshold who would answer correctly.
Later this was sometimes replaced by bookmarking processes (Mitzel et al. 2001).
These judgemental methods based on items alone have gradually declined in use, as
they have been found less useful and consistent with other information. However,
the alternative judgemental method of script scrutiny continues to be an important
part of the process. In this, a selection of completed scripts from the tests’ actual use
are scrutinised individually by a group of experienced markers and compared to the
scripts from previous years. This leads to a combined judgement of where the cut-
scores for each level should be.
Some form of statistical equating has been used consistently each year for the
national curriculum tests. The development process for the tests is such that each test
is available a year in advance of its use. It is then given (in a secure fashion) to
students who are about to take the current year’s test. The two sets of results can
then be equated, either using well-established whole-test methods (Angoff 1971) or
Educational Research 149
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
through item response theory methods (Holland and Rubin 1982). An issue in these
processes has been what has come to be known as the pre-test effect. It became
apparent that children’s motivation and knowledge were less when taking the low-
stakes equating test than when doing the actual test (Pyle et al. 2009, this issue). This
effect had then to be allowed for judgementally in the decision process. In response
to this difficulty, a variant on equating has been the use of an anchor test, i.e. a test in
the subject area that is used over a number of years and equated each year with the
new test. This has the advantage of checking accumulated drift from the year-on-
year equating and of avoiding the pre-test effect, since both tests are taken under
low-stakes conditions.
Ultimately, the Rose Inquiry reported that there had been no ‘fiddling’ of the
results to present improved literacy and numeracy outcomes. This was accepted since
Blunkett was a better politician than psychometrician and had included opposition
appointees, a leading head teacher and the Times newspaper’s education
correspondent on the inquiry panel.
However, the Rose Inquiry was largely into whether proper procedures had been
followed in setting and marking the 1999 tests. There was a more fundamental
question raised by the political importance of the targets as measures of change over
time. This was whether the use of these procedures over an extended period had
maintained the required standard at a constant level. Following the Rose report, this
question was addressed by a large-scale study undertaken by Massey et al. (2003)
and funded by QCA. This research was conducted in Northern Ireland and they
administered the tests used in England from 1996 and a later year (1999, 2000 or
2001) to the same students. The tests were key stage 1 reading and mathematics; key
stage 2 English, mathematics and science; key stage 3 English, mathematics and
science. Massey et al. acknowledge the methodological difficulties of their approach
but believe that they were as successful as possible in obtaining conditions fair to
both versions being compared, and conclude that overall between 1996 and 2000,
standards in the national testing system were maintained or became marginally
stricter more often than they could be challenged. Nevertheless, for one high-profile
component, key stage 2 English, the experimental evidence indicated that a
significant proportion of the apparent improvement in national results may have
arisen from variation in test standards.
Such a conclusion would be supported by Tymms (2004), who challenged the
validity of the rise in proportion of children achieving the target level 4 in the key
stage 2 tests. This was a response to the claims of the government advisor who had
been responsible for many of the educational initiatives of the new Labour
government, first as chief adviser to the secretary of state for education on school
standards in and then in the policy office of the prime minister (Barber 2001).
Tymms considered the general pattern of changes, comparing the statutory test
data with the results from other studies and found clear rises in standards, but not
as strong as the official data suggested up to 2000. He agreed that the small
changes since 2000 were consistent. Tymms went further and raised this issue with
the Statistics Commission. This was an independent public body, set up in June
2000 to ‘help ensure that official statistics are trustworthy and responsive to public
needs’. (It has since been subsumed into the UK Statistics Authority.) Tymms’
concern was that the statistics used – key stage 2 test scores – were not suitable for
the purpose of monitoring trends in standards over a period of years. The
Statistics Commission (2005) concluded ‘that it has been established that
150 C. Whetton
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
(a) the improvement in key stage 2 test scores between 1995 and 2000 substantially
overstates the improvement in standards in English primary schools over that
period, but (b) there was nevertheless some rise in standards.’ However, it also
concluded that ‘we are aware of no particular fault with the procedures that QCA
now follow for maintaining test standards over time’ and further that ‘test scores
may not be an ideal measure of standards over time, but it does not follow that
they are a completely unsuitable measure for a [government] target. There is no
real alternative at present to using statutory test scores for setting targets for
aggregate standards.’
A first lessening in the testing regime occurred in 2005. As public pressure
mounted on ministers over the effects of testing, one strand of complaint was
concerned with the effects on children. It was apparent that the public (if not the
professional) view of formal testing of children at the ages of 11 and 14 was
acceptable. However, there were greater public concerns over the pressures on young
children arising from them being subjected to formal tests. In response, in 2004,
QCA organised a trial of a new approach to the key stage 1 assessment in which
more emphasis was placed on teacher assessment and the final judgement was that of
the teacher. Schools had, though, to first use nationally provided test or tasks, but at
a time of their own choice. The trial was evaluated by Leeds University (Shorrocks-
Taylor et al. 2004) and judged successful, so that from 2005, the new system was
adopted for all schools.
This was the first time since the early days of the testing regime that the testing
arrangements had been lightened, and the first return to teachers’ assessment taking
precedence over tests.
It was perhaps significant that this was for the key stage 1 arrangements, where
the tests served less of an accountability function, but this could nevertheless be seen
as something of a possible model for the future of NCA.
However, less optimistically, an NFER evaluation of the arrangements a year
later struck a note of caution. Only a minority of teachers were taking advantage of
the flexibility of the new arrangements; the majority were using the same tasks/tests
as previously and continuing to administer them during the summer term. There
were indications, however, that teachers intended to be more flexible, in choice of
tasks/tests and timing, in future (Reed and Lewis 2005).
Apart from this, from 2000 through to 2008, the NCA system was relatively
unchanging. However, this is not to say it was without criticism. These criticisms fall
into several groups:
. that the system has developed into having too many purposes so that it cannot
adequately serve them all (Newton 2007, identifies 18 ways in which the test
results are being used);
. that the accountability function puts too much pressure on schools;
. that the accountability function and the nature of the tests leads to a narrowing
of the curriculum;
. that the tests put too much pressure on children;
. that standards would be raised to a greater degree (and more widely and
validly) through an emphasis on ‘Assessment for Learning’.
During 2007 and 2008, these issues were examined in two different arenas: the
academic and the political. Under Robin Alexander, the Primary Review based at
Educational Research 151
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
Cambridge University examined the condition and future of primary education in
England. The Primary Review commissioned academic surveys on all aspects of
primary education, including the assessment system. The paper dealing with this
concluded:
The current system of assessment in England provides information of only lowdependability whilst having some negative impacts on teaching and learning.Alternative systems need to be considered. One in which summative assessment isbased on teachers’ judgements would provide information that is more valid than testsand at least as reliable, but it would be necessary to avoid high stakes being attached tothe results by not using them for purposes other than reporting on individual pupils.For national monitoring, a regular sample survey, using a large bank of items, would
give far more information than is provided by results of individual pupils who have alltaken the same test. (Harlen 2007)
Other reviews by Wyse, McCreery and Torrance (2008) on the impact of national
reforms in curriculum and assessment on English primary schools, and Tymms and
Merrell (2007) on standards and quality in English primary schools over time offered
similar conclusions, from their own perspectives.
These surveys provided a powerful academic case against the current form of
national assessment system, but their critical tone has been perceived by government
as unhelpful, rather than constructive, which may have limited their immediate
influence, although their longer-term effects remain to be seen.
The other major academic input has been the promotion of Assessment for
Learning, which has had rather more success. The driver of this success was a small
number of academics acting together as a modern single-issue pressure group under
the banner of the Assessment Reform Group (Daugherty 2007). The starting point
was from the work of Paul Black and Dylan Wiliam (1998), which demonstrated that
improving formative assessment raises standards. Subsequent work was devoted to
clarifying the definition of Assessment for Learning and in proselytising on its
behalf. The outputs of the group have ranged from summary leaflets for teachers to a
full exposition of the practice and policy of using assessment to help and report
learning (Gardner 2006). The activities of the group have been extremely influential
and the phrase ‘Assessment for Learning’ has been widely adopted including by
government and its agencies. This culminated in June 2008, when the government
announced spending of £150 million on a programme calling itself ‘The Assessment
for Learning Strategy’ (DCSF 2008). However, as is often the case with a widespread
adoption of carefully nuanced principles, the mass messages may not have been what
the originators wished for.
The second, political, examination of NCA was that of the the Children, Schools
and Families Committee of the House of Commons, which undertook an enquiry
into Testing and Assessment (GB. Parliament. HoC. CSF Committee 2008). Both
the standards strand and the Assessment for Learning strand figure strongly in that
report. In a slight contrast to the academics’ general hostility to national testing for
the committee concluded that:
We consider that the weight of evidence in favour of the need for a system of nationaltesting is persuasive and we are content that the principle of national testing is sound.Appropriate testing can help to ensure that teachers focus on achievement and oftenthat has meant excellent teaching, which is very welcome. (Para. 25)
152 C. Whetton
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
However, echoing other views, the report went on to criticise the current system and
urge the government to find an alternative:
National test results are now used for a wide variety of purposes across many differentlevels – national, local, institutional and individual. Each of these purposes may belegitimate in its own right, but the question we have asked is whether the currentnational testing system is a valid means by which to achieve these purposes. Weconclude that, in some cases, it is not. In particular, we find that the use of national testresults for the purpose of school accountability has resulted in some schoolsemphasising the maximisation of test results at the expense of a more roundededucation for their pupils. A variety of classroom practices aimed at improving testresults has distorted the education of some children . . . We find that ‘teaching to thetest’ and narrowing of the taught curriculum are widespread phenomena in schools,resulting in a disproportionate focus on the ‘core’ subjects of English, mathematics andscience and, in particular, on those aspects of these subjects which are likely to be testedin an examination . . . We conclude that the national testing system should be reformedto decouple these multiple purposes in such a way as to remove from schools theimperative to pursue test results at all costs. (Summary)
Even earlier than the parliamentary committee report, from 2007, there seems to
have been a growing acceptance that the NCA system should change. For the
government, this had two motivations: to attempt to reduce the criticism of the
testing system from professionals and the media, and to reinvigorate attempts to
improve standards that they saw as having stalled, as indicated by the small increases
in proportions reaching the accountability targets from 2000 onwards.
The proposed solution was announced by Alan Johnson, the secretary of state for
education, in January 2007. This was known as the Making Good Progress pilot and
within this an assessment system based on teacher assessment and ‘single level tests’. As
ever, this was a political decision taken in haste without consultation or any detailed
planning. Rather the broad principles were announced and the problem handed over to
officials and statutory bodies who were charged with providing a workable system.
Making Good Progress (DfES 2007) stated that the government was interested in
‘exploring the impact of enabling teachers to enter a pupil for an externally marked
test as soon as they are confident (through their own systematic assessments) that the
pupil has progressed to the next level.’ The proposal went on to say that the tests
would be offered to schools at two points in the year and pupils could be entered
individually for a test that marks success at one level, and stimulates progress
towards the next level. They were to be no more burdensome than the current end-
of-key stage ‘multi-level’ tests and would:
generate the data on achievement that is so important for school accountability. Thesystem would be a one-way ratchet: once a pupil has passed a level, they will never goback, only forward. The model could be a powerful driver for progression, raisingexpectations for all pupils, motivating them, bringing a sharp focus on ‘next steps’ andperhaps especially benefiting those who start the key stage with lower attainment thantheir peers, or are currently making too little progress. Ultimately, these tests mightreplace end of key stage arrangements.
Since that announcement, the pilot has begun, the tests have begun to be developed
and, as for the SATs in the early days of the national curriculum, the model is being
refined in an attempt to make a broad conception into a working system. It remains
to be seen whether this can be achieved.
Educational Research 153
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
It is important, though, to recognise what was not being given up here. Despite
the criticism from many of those within education of the distorting effects of
accountability, this remains a part of the proposals for the future. Similarly, the use
of short written tests would retain the possibility of a narrowing of the curriculum in
these subjects to the aspects being tested. In broad terms, what was being proposed
was not a change of strategy but a change of tactics.
This change of tactics took a further sudden step in October 2008, when (yet
another) secretary of state for schools, Ed Balls, unexpectedly announced the
cessation of National Curriculum testing at key stage 3, with immediate effect. This
included both the existing key stage tests and the single level tests, which would
continue to be piloted for key stage 2 only.
In making this announcement, the Secretary of State set out the government’s
view of three key principles for the continuation of the assessment system. He said it
should:
– First, give parents the information they need to compare different schools,
choose the right school for their child and then track their child’s progress;
– Second, enable head teachers and teachers to secure the progress of every child
and their school as a whole, without unnecessary burdens or bureaucracy;
– Third, allow the public to hold national and local government and governing
bodies to account for the performance of schools.
In the light of these principles, he stated that he had concluded that the key stage 3
testing was not justified. Parents could obtain information from GCSE results and a
new system of ‘real-time reporting of progress’ would be developed for tracking
individual pupils. In addition, there would also be an externally marked test with a
sample of pupils to measure national performance, holding the government to
account. The details of the new elements are yet to be determined, and an expert
group was set up to advise on them.
It is always difficult to predict the future, but it does seem that NCA in England is
about to change further, beyond even this announcement. It is not easy, though, to
predict what this change will be. The parliamentary committee accepted that there
should be some system of national testing, but wished it to be reformed. At present,
the government retains it wish to have an accountability measure at key stage 2, both
because of long memories of the lack of control in primary education before the
Education Reform Act and because target setting currently remains the manner in
which central government manages large state enterprises. It may be that an
alternative philosophy of government emerges, but this is not apparent at present.
Indeed, the key stage 3 announcement reinforced the belief in the current
philosophy.
It is a reasonable question to ask whether NCA has been successful but like many
simple questions, there is not one straightforward answer. A first clarification needs
to be – successful in terms of what criteria? NCA can be considered from a political
viewpoint, an educational viewpoint, an assessment viewpoint and even as an
operational process.
From a political viewpoint, overall the system can generally be seen as a success,
with some caveats. The original reasons for central government intervention in
education in the 1980s were to bring a measure of accountability to schools, to
control the curriculum and to improve standards of achievement. It is now the case
154 C. Whetton
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
that schools more readily acknowledge their responsibilities to their students, to
parents and to the community at large to a far greater extent. In general, they also
accept the need for some accountability measures, although perhaps not the current
ones. This has not come about only through NCA and the publication of results, but
also through an inspection system, originally frequent and stringent, and measures to
step in and manage or close failing schools. The testing system has contributed to
curriculum implementation and control through the concentration on English,
mathematics and science, ensuring their emphasis in schools. It has also helped to
give priority to particular subject elements from time to time. It is seen as having
raised standards (as elaborated in the government evidence to the Select Committee),
though the extent of this is disputed by others (Statistics Commission 2005). More
negatively from a political viewpoint, there was the correction to the process in the
1990s following the first introduction, but the political interpretation is that this was
a necessary sorting-out of the over-complexity created by the educational
professionals. Finally, a measure of success is that until 2008, there was no
widespread public demand for an end to the system, the clamour mostly arising from
educational professionals, who could be considered to be producer interests. The
failure of delivery of the marking agency in 2008 increased criticism of the system
and led to other aspects, such as marker reliability, being challenged. Hence, the
period of political success is coming to an end, and a new phase is required to
maintain the continuing objective of further improvement, without losing the other
aims of control and accountability.
From an educational viewpoint, the view is rather different – see the papers of the
Cambridge Review: Harlen (2007), Wyse, McCreery and Torrance (2008), Tymms
and Merril (2007). The pressure placed on schools through the accountability
mechanisms is considered counter-productive and the curriculum has been narrowed
both in terms of subjects studied and the nature of the learning process. The teaching
to the tests may consist of coaching in test-taking skills and emphasising only the
content of the tests, rather than the subject more broadly. This is regarded as
fostering ‘shallow learning’ rather than ‘deep learning’ (Gipps 1994, 23–4). It is also
believed that the momentum of raising standards would have been greater through
investment in formative or classroom assessment rather than a testing regime. In
summary then, the view of educational academics and teachers representatives is that
NCA has not been an educational success, and other lighter-touch mechanisms
would have been and are preferable.
Judged from an assessment viewpoint, the emphasis must be on validity with
reliability encompassed within this. Stobart (2009, this issue) and Newton (2009,
this issue) give their verdicts on these. The problem with arriving at an overall view
is that validity should arise from a clear statement of purpose, and this has been
difficult to determine for NCA. The original broad purposes of TGAT have been
supplemented by many others over the years. Newton (2007) identifies 18 different
purposes of the NCA system and the evidence to the Select Committee located
several more. Hence, the system as a whole meets these purposes to varying
extents, but because of their number achieves none perfectly. The key stage 3
announcement in October 2008 was a clear statement of the government’s priorities
among the many purposes. From the narrow viewpoint of the tests themselves,
there were 15 years in which a new style of high-quality curriculum-related testing
was established and many problems addressed, bringing test development in the
UK up to date.
Educational Research 155
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
Finally, there is the organisational element. Certainly for QCA a successful year
has been considered to be one in which the tests are delivered to schools, collected
back, marked and the results collated and returned, to time, without loss or
complaint. In general this has happened – the complaints have not been
unmanageable and the losses relatively few. The events of 2008, with a failure of
delivery of the marking contract, though recent are the exception. From time to time,
there have been economic evaluations of the system and in general terms these tend
to show that the operations represent value for money. The costs are around £40
million per year or just over £2 per child tested. The average cost per school is
around £4000 per secondary school and £1000 per primary school. This could be
seen as good value for the effects on the system (if these are considered beneficial)
and can bear comparison with the costs of promoting teacher assessment (£150
million over three years announced in 2008) or moderation systems. Certainly, it
would be difficult to have a similar effect throughout a large system such as that in
England through other means for the same expenditure.
An overall evaluation of NCA has therefore to be mixed, since it has not
satisfied all stakeholders and any view of it depends on the viewpoint. It is a
system that has not been so bad that it has failed or been completely abolished, but
neither has it been so good that it is widely acclaimed. As such it is like many of
the modern public institutions of the UK (or more narrowly in this case, England),
which are of such a size, such an organisation and such complexity that they are
difficult to change and even more difficult to improve in a manner that gives all
stakeholders what they want. It has come to have a momentum that gives it
staying power, and even though there is at present a desire for change, it cannot be
certain that any new system will be necessarily better and without its own
unforeseen consequences. The next few years will determine the nature of the
change, and the years beyond that, whether this will come to be regarded as an
improvement. But that is another, future history.
Acknowledgements
I am grateful for the assistance and comments of Marian Sainsbury, Jo-Anne Baird, FrancesBrill and Felicity Fletcher-Campbell who all provided helpful comments to improve andclarify the arguments of this article.
Notes
1. ‘Thatcherite’ is a term used to encapsulate the policies and beliefs of Margaret Thatcher,prime minister of the UK from 1979 to 1990. ‘Thatcherism’ is characterised by decreasedstate intervention, monetarist economic policy, privatisation of state-owned industries,lower taxation and opposition to trade unions. Paradoxically, the intervention ineducation increased state control. Thatcher was leader of the Conservative political party,traditionally the party of the right.
2. ‘Blairite’ is a term used to describe the beliefs or followers of Tony Blair, British primeminister 1997 to 2007. Blair’s policies are not easy to encapsulate but did include thepromotion of quasi-markets into public services, including education. Blair was leader ofthe (New) Labour political party, traditionally the party of the left, though increasinglycentrist under Blair.
3. Ruskin College is an independent educational institution in Oxford, founded to provideopportunities for working class men and women, and with strong links to the TradeUnion movement. Its choice for this speech was therefore a carefully considered politicalstatement.
156 C. Whetton
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
4. The title of the chief finance minister of the UK government.5. A feature of the governmental organisation of the UK is the use of statutory bodies
to implement change or regulate markets. Such bodies are nominally independent andoutside government departments, but are nevertheless heavily controlled by them.
6. ‘Secretary of State’ is the title used in the UK for ministers of largegovernment departments. Often, Secretaries of State have several junior ministersworking under them. In the period described in this article, the Secretary of Stateresponsible for schools had a variety of other responsibilities in addition, with thesechanging frequently.
7. From this point on, the term SAT has never been used in an official publication.Nevertheless, it has proved to be the most long-lasting element of TGAT, and remains incommon (mis-used) parlance in public and press debates.
8. Page Heading: Test development, level setting and maintaining standards. www.qca.org.uk/qca.42021.aspx (accessed 30 July 2008). However, QCA’s website constantly changesits URLs.
References
Angoff, W.H. 1971. Scales, norms and equivalent scores. In Educational measurement, ed. R.L.Thorndike, 508–600. Washington, DC: American Council on Education.
Baker, K. 1993. The turbulent years: My life in politics. London: Faber & Faber.Barber, M., ed. 1996. The National Curriculum: A study in policy. Keele: Keele University
Press.Barber, M. 2001. The very big picture. School Effectiveness and School Improvement 12, no. 2:
213–28.Barnett, C. 1986. The audit of war: The illusion and reality of Britain as a great nation. London:
Macmillan.Beasley, M., E. Gould, S. Kirkham, and H. Emery. 2004. Report on Key Stage 3 English review
of service delivery failure 2003–2004 to QCA Board. London: QCA. http://www.qca.org.uk/libraryAssets/media/10343_ks3_en_report_04.pdf (accessed 31 July 2008).
Black, P. 1997. Whatever happened to TGAT? In Assessment vs evaluation, ed. CedricCullingford, 24–50. London: Cassell.
Black, P., and D. Wiliam. 1998. Inside the black box: Raising standards through classroomassessment. London: Kings College London, School of Education.
Blunkett, D. 2006. The Blunkett tapes: My life in the bear pit. London: BloomsburyPublishing.
Daugherty, R. 1995. National curriculum assessment: A review of policy 1987–1994. London:Falmer Press.
Daugherty, R. 2007. Mediating academic research: the Assessment Reform Group experience.Research Papers in Education 22, no. 2: 139–53.
Dearing, R. 1994. The National Curriculum and its assessment: Final report. London: SCAA.Department for Children, Schools and Families. 2008. The Assessment for Learning Strategy.
London: DCSF.Department for Education and Skills. 2007. Making good progress: How can we help
every pupil to make good progress at school? London: DfES. http://www.dfes.gov.uk/consultations/downloadableDocs/How%20can%20we%20help%20every%20pupil%20to%20make%20good%20progress%20at%20school.pdf (accessed 31 July 2008).
Department of Education and Science. Central Advisory Council for Education (England).1967. Children and their primary schools. Plowden report. London: HMSO.
Department of Education and Science and Welsh Office. 1988. National Curriculum: TaskGroup on Assessment and Testing. A report. London: DES.
Gardner, J., ed. 2006. Assessment and learning. London: Sage.Gipps, C., 1994. Beyond testing: Towards a theory of educated assessment. London: Falmer
Press.Gipps, C., M. Brown, B. McCallum, and S. McAlister. 1995. Intuition or evidence? Teachers
and national assessment of seven-year-olds. Buckingham: Open University Press.
Educational Research 157
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
Graham, D., and D. Tytler. 1993. A lesson for us all: The making of the National Curriculum.London: Routledge.
Great Britain. Parliament. House of Commons. Children, Schools and Families Committee.2008. Testing and assessment volume I: Report, together with formal minutes. Third report ofsession 2007–08 (HC: 169-I). London: The Stationery Office.
Harlen, W. 2007. The quality of learning: assessment alternatives for primary education.Primary review research survey 3/4. Cambridge: Esmee Fairbairn Foundation. http://gtcni.openrepository.com/gtcni/bitstream/2428/29272/1/Primary_Review_Harlen_3-4_report_Quality_of_learning_-_Assessment_alternatives_071102.pdf (accessed 31 July 2008).
Holland, P.W., and D.B. Rubin, eds. 1982. Test equating. New York: Academic Press.Lawson, N. 1992. The view from No. 11: Memoirs of a Tory radical. London: Bantam.Marr, A. 2007. A history of modern Britain. London: Macmillan.Massey, A., S. Green, T. Dexter, and L. Hamnett. 2003. Comparability of national tests over
time: Key stage test standards between 1996 and 2001. Final report to the QCA of thecomparability over time project. London: QCA.
Miliband, D. 2003. Don’t believe the NUT’s testing myths. Times Educational Supplement4558, 14 November: 19.
Mitzel, H.C., D.M. Lewis, R.J. Patz, and D.R. Green. 2001. The bookmark procedure:Psychological perspectives. In Setting performance standards: Concepts, methods, andperspectives, ed. G.J. Cizek, 249–81. Mahwah, NJ: Erlbaum.
Newton, P. 2009. The reliability of results from National Curriculum testing in England.Educational Research 51, no. 2: 181–212.
Newton, P.E. 2007. Clarifying the purposes of educational assessment. Assessment ineducation: Principles, Policy & Practice 14, no. 2: 149–70.
Newton, P.E., and C. Whetton. 2005. The effectiveness of systems for appealing againstmarking error. Oxford Review of Education 31, no. 2: 273–91.
Pyle, K., E. Jones, C. Williams, and J. Morrison. 2009. Investigation of the factors affectingthe pre-test effect in national curriculum science assessment development in England.Educational Research 51, no. 2: 269–82.
Reed, M., and K. Lewis. 2005. Key Stage 1 education of new assessment arrangements.London: QCA. http://www.qca.org.uk/libraryAssets/media/pdf_05_18931.pdf (accessed31 July 2008).
Rose, J. 1999. Weighing the baby: The report of the independent scrutiny panel on the 1999 KeyStage 2 National Curriculum tests in English and mathematics. London: DFEE.
Ruddock, G., Ba. Tomlins, with K. Mason, B. Holding, M. Reiss, W. Keys, D. Foxman, andI. Schagen. 1993. Evaluation of national curriculum assessment in mathematics and scienceat Key Stage 3: The 1992 national pilot. Final report. Slough: NFER.
Sainsbury, M., ed. 1996. SATs the inside story: the development of the first national assessmentsfor seven-year-olds, 1989–1995. Slough: NFER.
School Curriculum and Assessment Authority. 1995. Evaluation of the external marking ofNational Curriculum tests in 1995. London: SCAA.
Shorrocks, D., S. Daniels, L. Frobisher, N. Nelson, A. Waterson, and J. Bell. 1992. Theevaluation of national curriculum assessment at Key Stage 1 (ENCA 1 Project): Finalreport. London: SEAC.
Shorrocks-Taylor, D. 1999. National testing: Past, present and future. Leicester: BPSBooks.
Shorrocks-Taylor, D., B. Swinnerton, H. Ensaff, M. Hargreaves, M. Homer, G. Pell, P. Pool,and J. Threlfall. 2004. Evaluation of the trial assessment arrangements for Key Stage 1.Report to QCA. Leeds: University of Leeds. http://www.qca.org.uk/libraryAssets/media/8994_leeds_uni__ks1_eval_report.pdf (accessed 31 July 2008).
Statistics Commission. 2005. Measuring standarkds in English primary schools. Report no.23.London: Statistics Commission.
Stobart, G. 2009. Determining validity in national curriculum assessment (Special Issue:National curriculum assessment in England: How well has it worked?) EducationalResearch 51, no. 2: 161–79.
Thatcher, M. 1993. The Downing Street years. London: HarperCollins.Tymms, P. 2004. Are standards rising in English primary schools? British Educational
Research Journal 30, no. 4: 477–94.
158 C. Whetton
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4
Tymms, P., and C. Merrell. 2007. Standards and quality in English primary schools over time:the national evidence. Primary Review Research Survey 4/1. Cambridge: Esmee FairbairnFoundation. http://www.primaryreview.org.uk/Downloads/Int_Reps/2.Standards_quality_assessment/Primary_Review_Tymms_Merrell_4-1_report_Standards_Quality_071102.pdf(accessed 31 July 2008).
Whetton, C., and P. Newton. 2002. An evaluation of on-line marking. Paper presented at the28th International Association for Educational Assessment Conference, September 3, inHong Kong SAR, China.
Wyse, D., E. McCreery, and H. Torrance. 2008. The trajectory and impact of national reform:curriculum and assessment in English primary schools. Primary review research survey 3/2.Cambridge: Esmee Fairbairn Foundation. http://www.primaryreview.org.uk/Downloads/Int_Reps/7.Governance-finance-reform/RS_3-2_report_Curriculum_assessment_reform_080229.pdf (accessed 31 July 2008).
Educational Research 159
Dow
nloa
ded
by [
Ope
n U
nive
rsity
] at
00:
30 2
9 A
ugus
t 201
4