The Epistemology of Measurement: A Model-Based Account · I am especially thankful to Hasok Chang...
Transcript of The Epistemology of Measurement: A Model-Based Account · I am especially thankful to Hasok Chang...
The Epistemology of Measurement:
A Model-Based Account
by
Eran Tal
A thesis submitted in conformity with the requirements
for the degree of Doctor of Philosophy
Graduate Department of Philosophy
University of Toronto
© Copyright by Eran Tal 2012
ii
The Epistemology of Measurement: A Model-Based Account
Eran Tal, Doctor of Philosophy
Department of Philosophy, University of Toronto, 2012
Thesis abstract
Measurement is an indispensable part of physical science as well as of commerce,
industry, and daily life. Measuring activities appear unproblematic when performed with
familiar instruments such as thermometers and clocks, but a closer examination reveals a
host of epistemological questions, including:
1. How is it possible to tell whether an instrument measures the quantity it is
intended to?
2. What do claims to measurement accuracy amount to, and how might such
claims be justified?
3. When is disagreement among instruments a sign of error, and when does it
imply that instruments measure different quantities?
Currently, these questions are almost completely ignored by philosophers of science,
who view them as methodological concerns to be settled by scientists. This dissertation
shows that these questions are not only philosophically worthy, but that their exploration
has the potential to challenge fundamental assumptions in philosophy of science, including
the distinction between measurement and prediction.
iii
The thesis outlines a model-based epistemology of physical measurement and uses it to
address the questions above. To measure, I argue, is to estimate the value of a parameter in
an idealized model of a physical process. Such estimation involves inference from the final
state (‘indication’) of a process to the value range of a parameter (‘outcome’) in light of
theoretical and statistical assumptions. Idealizations are necessary preconditions for the
possibility of justifying such inferences. Similarly, claims to accuracy, error and quantity
individuation can only be adjudicated against the background of an idealized representation
of the measurement process.
Chapters 1-3 develop this framework and use it to analyze the inferential structure of
standardization procedures performed by contemporary standardization bureaus.
Standardizing time, for example, is a matter of constructing idealized models of multiple
atomic clocks in a way that allows consistent estimates of duration to be inferred from clock
indications. Chapter 4 shows that calibration is a special sort of modeling activity, i.e. the
activity of constructing and testing models of measurement processes. Contrary to
contemporary philosophical views, the accuracy of measurement outcomes is properly
evaluated by comparing model predictions to each other, rather than by comparing
observations.
iv
Acknowledgements
In the course of writing this dissertation I have benefited time and again from the
knowledge, advice and support of teachers, colleagues and friends. I am deeply indebted to
Margie Morrison for being everything a supervisor should be and more: generous with her
time and precise in her feedback, unfailingly responsive and relentlessly committed to my
success. I thank Ian Hacking for his constant encouragement, for never ceasing to challenge
me, and for teaching me to respect the science and scientists of whom I write. I owe many
thanks to Anjan Chakravartty, who commented on several early proposals and many sketchy
drafts; this thesis owes its clarity to his meticulous feedback. My teaching mentor, Jim
Brown, has been a constant source of friendly advice on all academic matters since my very
first day in Toronto, for which I am very grateful.
In addition to my formal advisors, I have been fortunate enough to meet faculty members in
other institutions who have taken an active interest in my work. I am grateful to Stephan
Hartmann for the three wonderful months I spent as a visiting researcher at Tilburg
University; to Allan Franklin for ongoing feedback and assistance during my visit to the
University of Colorado; to Paul Teller for insightful and detailed comments on virtually the
entire dissertation; and to Marcel Boumans, Wendy Parker, Léna Soler, Alfred Nordmann
and Leah McClimans for informal mentorship and fruitful research collaborations.
Many other colleagues and friends provided useful comments on this thesis at various stages
of writing, of which I can only mention a few. I owe thanks to Giora Hon, Paul Humphreys,
Michela Massimi, Luca Mari, Carlo Martini, Ave Mets, Boaz Miller, Mary Morgan, Thomas
Müller, John Norton, Isaac Record, Jan Sprenger, Jacob Stegenga, Jonathan Weisberg,
Michael Weisberg, Eric Winsberg, and Jim Woodward, among many others.
I am especially thankful to Hasok Chang for writing a thoughtful and detailed appraisal of
this dissertation, and to Joseph Berkovitz and Denis Walsh for serving on my examination
committee.
v
The work presented here depended on numerous physicists who were kind enough to meet
with me, show me around their labs and answer my often naive questions. I am grateful to
members of the Time and Frequency Division at the US National Institute of Standards and
Technology (NIST) and JILA labs in Boulder, Colorado for their helpful cooperation. The
long hours I spent in conversation with Judah Levine introduced me to the fascinating world
of atomic clocks and ultimately gave rise to the central case studies reported in this thesis.
David Wineland’s invitation to visit the laboratories of the Ion Storage Group at NIST in
summer 2009 resulted in a wealth of materials for this dissertation. I am also indebted to
Eric Cornell, Till Rosenband, Scott Diddams, Tom Parker and Tom Heavner for their time
and patience in answering my questions. Special thanks go to Chris Ellenor and Rockson
Chang, who, as graduate students in Aephraim Steinberg’s laboratory in Toronto, spent
countless hours explaining to me the technicalities of Bose-Einstein Condensation.
My research for this dissertation was supported by several grants, including three Ontario
Graduate Scholarships, a Chancellor Jackman Graduate Fellowship in the Humanities, a
School of Graduate Studies Travel Grant (the latter two from the University of Toronto),
and a Junior Visiting Fellowship at Tilburg University.
I am indebted to Gideon Freudenthal, my MA thesis supervisor, whose enthusiasm for
teaching and attention to detail inspired me to pursue a career in philosophy.
My mother, Ruth Tal, has been extremely supportive and encouraging throughout my
graduate studies. I deeply thank her for enduring my infrequent visits home and the
occasional cold Toronto winter.
Finally, to my partner, Cheryl Dipede, for suffering through my long hours of study with
only support and love, and for obligingly jumping into the unknown with me, thanks for
being you.
vi
Table of Contents
Introduction ...................................................................................................... 1
1. Measurement and knowledge........................................................................... 1 2. The epistemology of measurement ................................................................. 3 3. Three epistemological problems...................................................................... 5
The problem of coordination.................................................................... 8 The problem of accuracy ......................................................................... 11 The problem of quantity individuation.................................................. 12 Epistemic entanglement........................................................................... 14
4. The challenge from practice........................................................................... 15 5. The model-based account............................................................................... 17 6. Methodology..................................................................................................... 21 7. Plan of thesis .................................................................................................... 24
1. How Accurate is the Standard Second? ................................................... 26
1.1. Introduction...................................................................................................... 26 1.2. Five notions of measurement accuracy ........................................................ 29 1.3. The multiple realizability of unit definitions................................................ 33 1.4. Uncertainty and de-idealization ..................................................................... 37 1.5. A robustness condition for accuracy ............................................................ 40 1.6. Future definitions of the second ................................................................... 44 1.7. Implications and conclusions......................................................................... 46
2. Systematic Error and the Problem of Quantity Individuation ................ 48
2.1. Introduction...................................................................................................... 48 2.2. The problem of quantity individuation ........................................................ 51
2.2.1. Agreement and error................................................................... 51 2.2.2. The model-relativity of systematic error .................................. 55 2.2.3. Establishing agreement: a threefold condition........................ 59 2.2.4. Underdetermination.................................................................... 62 2.2.5. Conceptual vs. practical consequences .................................... 64
2.3. The shortcomings of foundationalism ......................................................... 67 2.3.1. Bridgman’s operationalism......................................................... 68 2.3.2. Ellis’ conventionalism................................................................. 70 2.3.3. Representational Theory of Measurement .............................. 73
2.4. A model-based account of measurement..................................................... 78 2.4.1. General outline ............................................................................ 78 2.4.2. Conceptual quantity individuation............................................ 83 2.4.3. Practical quantity individuation ................................................. 88
2.5. Conclusion: error as a conceptual tool ......................................................... 91
vii
3. Making Time: A Study in the Epistemology of Standardization ............ 93
3.1. Introduction...................................................................................................... 93 3.2. Making time universal ..................................................................................... 99
3.2.1. Stability and accuracy.................................................................. 99 3.2.2. A plethora of clocks..................................................................103 3.2.3. Bootstrapping reliability ...........................................................106 3.2.4. Divergent standards ..................................................................108 3.2.5. The leap second.........................................................................111
3.3. The two faces of stability..............................................................................112 3.3.1. An explanatory challenge .........................................................112 3.3.2. Conventionalist explanations...................................................113 3.3.3. Constructivist explanations......................................................118
3.4. Models and coordination..............................................................................123 3.4.1. A third alternative......................................................................123 3.4.2. Mediation, legislation, and models..........................................126 3.4.3. Coordinative freedom...............................................................130
3.5. Conclusions ....................................................................................................136
4. Calibration: Modeling the Measurement Process ..................................138
4.1. Introduction....................................................................................................138 4.2. The products of calibration..........................................................................142
4.2.1. Metrological definition .............................................................142 4.2.2. Indications vs. outcomes..........................................................143 4.2.3. Forward and backward calibration functions........................146
4.3. Black-box calibration.....................................................................................148 4.4. White-box calibration....................................................................................151
4.4.1. Model construction ...................................................................151 4.4.2. Uncertainty estimation..............................................................154 4.4.3. Projection ...................................................................................158 4.4.4. Predictability, not just correlation...........................................160
4.5. The role of standards in calibration ............................................................164 4.5.1. Why standards?..........................................................................164 4.5.2. Two-way white-box calibration...............................................165 4.5.3. Calibration without metrological standards...........................168 4.5.4. A global perspective..................................................................170
4.6. From predictive uncertainty to measurement accuracy ...........................174 4.7. Conclusions ....................................................................................................177
Epilogue .........................................................................................................178
Bibliography................................................................................................... 181
viii
List of Tables
Table 1.1: Comparison of uncertainty budgets of aluminum (Al) and mercury (Hg) optical atomic clocks. ............................................................. 45
Table 3.1: Excerpt from Circular-T............................................................................... 104 Table 4.1: Uncertainty budget for a torsion pendulum measurement of G,
the Newtonian gravitational constant......................................................... 156 Table 4.2: Type-B uncertainty budget for NIST-F1, the US primary
frequency standard......................................................................................... 166
List of Figures
Figure 3.1: A simplified hierarchy of approximations among model parameters in contemporary timekeeping..................................................129
Figure 4.1: Modules and parameters involved in a white-box calibration of a simple caliper..................................................................................................153
Figure 4.2: A simplified diagram of a round-robin calibration scheme .....................169
1
The Epistemology of Measurement: A Model-Based Account
Introduction
I often say that when you can measure what you are speaking about and express
it in numbers you know something about it; but when you cannot measure it,
when you cannot express it in numbers, your knowledge is of a meagre and
unsatisfactory kind […].
– William Thomson, Lord Kelvin (1891, 80)
1. Measurement and knowledge
Measurement is commonly seen as a privileged source of scientific knowledge. Unlike
qualitative observation, measurement enables the expression of empirical claims in
mathematical form and hence makes possible an exact description of nature. Lord Kelvin’s
famous remark expresses high esteem for measurement for this same reason. Today, in an
age when thermometers and ammeters produce stable measurement outcomes on familiar
scales, Kelvin’s remark may seem superfluous. How else could one gain reliable knowledge
of temperature and electric current other than through measurement? But the quantities
called ‘temperature’ and ‘current’ as well as the instruments that measure them have long
2
histories during which it was far from clear what was being measured and how – histories in
which Kelvin himself played important roles1.
These early struggles to find principled relations between the indications of material
instruments and values of abstract quantities illustrate the dual nature of measurement. On
the one hand, measurement involves the design, execution and observation of a concrete
physical process. On the other hand, the outcome of a measurement is a knowledge claim
formulated in terms of some abstract and universal concept – e.g. mass, current, length or
duration. How, and under what conditions, are such knowledge claims warranted on the
basis of material operations?
Answering this last question is crucial to understating how measurement produces
knowledge. And yet contemporary philosophy of measurement offers little by way of an
answer. Epistemological concerns about measurement were briefly popular in the 1920s
(Campbell 1920, Bridgman 1927, Reichenbach [1927] 1958) and again in the 1960s (Carnap
[1966] 1995, Ellis 1966), but have otherwise remained in the background of philosophical
discussion. Until less than a decade ago, the philosophical literature on measurement focused
on either the metaphysics of quantities (Swoyer 1987, Michell 1994) or the mathematical
structure of measurement scales. The Representational Theory of Measurement (Krantz et al
1971), for example, confined itself to a discussion of structural mappings between empirical
and quantitative domains and neglected the possibility of telling what, and how accurately,
such mappings measure. It is only in the last several years that a new wave of philosophical
writings about the epistemology of measurement has appeared (most notably Chang 2004,
Boumans 2006, 2007 and van Fraassen 2008, Ch. 5-7). Partly drawing on these recent
1 See Chang (2004, 173-186) and Gooday (2004, 2-9).
3
achievements, this thesis will offer a novel systematic account of the ways in which
measurement produces knowledge.
2. The epistemology of measurement
The epistemology of measurement, as envisioned in this dissertation, is a subfield of
philosophy concerned with the relationships between measurement and knowledge. Central
topics that fall under its purview are the conditions under which measurement produces
knowledge; the content, scope, justification and limits of such knowledge; the reasons why
particular methodologies of measurement and standardization succeed or fail in supporting
particular knowledge claims; and the relationships between measurement and other
knowledge-producing activities such as observation, theorizing, experimentation, modeling
and calculation. The pursuit of research into these topics is motivated not only by the need
to clarify the epistemic functions of measurement, but also by the prospects of contributing
to other areas of philosophical discussion concerning e.g. reliability, evidence, causality,
objectivity, representation and information.
As measurement is not exclusively a scientific activity – it plays vital roles in
engineering, medicine, commerce, public policy and everyday life – the epistemology of
measurement is not simply a specialized branch of philosophy of science. Instead, the
epistemology of measurement is a subfield of philosophy that draws on the tools and
concepts of traditional epistemology, philosophy of science, philosophy of language,
philosophy of technology and philosophy of mind, among other subfields. It is also a multi-
4
disciplinary subfield, ultimately engaging with measurement techniques from a variety of
disciplines as well as with the histories and sociologies of those disciplines.
The goal of providing a comprehensive epistemological theory of measurement is
beyond the scope of a single doctoral dissertation. This thesis is cautiously titled ‘account’
rather than ‘theory’ in order to signal a more modest intention: to argue for the plausibility
of a particular approach to the epistemology of measurement by demonstrating its strengths
in a specific domain. I call my approach ‘model-based’ because it tackles epistemological
challenges by appealing to abstract and idealized models of measurement processes. As I will
explain below, this thesis constitutes the first systematic attempt to bring insights from the
burgeoning literature on the philosophy of scientific modeling to bear on traditional
problems in the philosophy of measurement. The specific domain I will focus on is physical
metrology, officially defined as “the science of measurement and its application”2. Metrologists
are the physicists and engineers who design and standardize measuring instruments for use
in scientific and commercial applications, and often work at standardization bureaus or
specially accredited laboratories.
The immediate aim of this dissertation, then, is to show that a model-based approach
to measurement successfully solves certain epistemological challenges in the domain of
physical metrology. By achieving this aim, a more far-reaching goal will also be
accomplished, namely, a demonstration of the importance of research into the epistemology
of measurement and of the promise held by model-based approaches for further research in
this area.
2 JCGM 2008, 2.2.
5
The epistemological challenges addressed in this thesis may be divided into two kinds.
The first kind consists of abstract and general epistemological problems that pertain to any
sort of measurement, whether physical or nonphysical (e.g. of social or mental quantities). I
will address three such problems: the problem of coordination, the problem of accuracy, and
the problem of quantity individuation. These problems will be introduced in the next
section. The second kind of epistemological challenge consists of problems that are specific
to physical metrology. These problems arise from the need to explain the efficacy of
metrological methods for solving problems of the first sort - for example, the efficacy of
metrological uncertainty evaluations in overcoming the problem of accuracy. After
discussing these ‘challenges from practice’, I will introduce the model-based account,
explicate my methodology and outline the plan of this thesis.
3. Three epistemological problems
This thesis will address three general epistemological problems related to
measurement, which arise when one attempts to answer the following three questions:
1. Given a procedure P and a quantity Q, how is it possible to tell whether P
measures Q?
2. Assuming that procedure P measures quantity Q, how is it possible to tell how
accurately P measures Q?
6
3. Assuming that P and P′ are two measuring procedures, how is it possible to
tell whether P and P′ measure the same quantity?
Each of these three questions pertains to the possibility of obtaining knowledge of
some sort about the relationship between measuring procedures and the quantities they
measure. The sort of possibility I am interested in is not a general metaphysical or epistemic
one – I do not consider the existence of the world or the veridical character of perception as
relevant answers to the questions above. Rather, I will be interested in possibility in the
practical, technological sense. What is technologically possible is what humans can do with
the limited cognitive and material resources they have at their disposal and within reasonable
time3. Hence to qualify as an adequate answer to the questions above, a condition of
possibility must be cognitively accessible through one or more empirical tests that humans may
reasonably be expected to perform. For example, an adequate answer to the first question
would specify the sort of evidence scientists are required to collect in order to test whether
an instrument is a thermometer – i.e. whether or not it measures temperature – as well as
general considerations that apply to the analysis of this evidence.
An obvious worry is that such conditions are too specific and can only be supplied on
a case-by-case basis. This worry would no doubt be justified if one were to seek particular
test specifications or ‘experimental recipes’ in response to the questions above. No single
test, nor even a small set of tests, exist that can be applied universally to any measuring
procedure and any quantity to yield satisfactory answers to the questions above. But this
worry is founded on an overly narrow interpretation of the questions’ scope. The conditions
3 For an elaboration of the notion of technological possibility see Record (2011, Ch. 2).
7
of possibility sought by the questions above are not empirical test specifications but only
general formal constraints on such specifications. These formal constraints, as we shall see,
pertain to the structure of inferences involved in such tests and to general representational
preconditions for performing them. Of course, it is not guaranteed in advance that even
general constraints of this sort exist. If they do not, knowledge claims about measurement,
accuracy and quantity individuation would have no unifying grounds. Yet at least in the case
of physical quantities, I will show that a shared inferential and representational structure
indeed underlies the possibly of knowing what, and how accurately, one is measuring.
Another, sceptical sort of worry is that the questions above may have no answer at all,
because it may in fact be impossible to know whether and how accurately any given
procedure measures any quantity. I take this worry to be indicative of a failure in
philosophical methodology rather than an expression of a cautious approach to the
limitations of human knowledge. The terms “measurement”, “quantity” and “accuracy”
already have stable (though not necessarily unique) meanings set by their usage in scientific
practice. Claims to measurement, accuracy and quantity individuation are commonly made in
the sciences based on these stable meanings. The job of epistemologists of measurement, as
envisioned in this thesis, is to clarify these meanings and make sense of scientific claims
made in light of such meanings. In some cases the epistemologist may conclude that a
particular scientific claim is unfounded or that a particular scientific method is unreliable.
But the conclusion that all claims to measurement are unfounded is only possible if
philosophers create perverse new meanings for these terms. For example, the idea that
measurement accuracy is unknowable in principle cannot be seriously entertained unless the
meaning of “accuracy” is detached from the way practicing metrologists use this term, as will
be shown in Chapter 1. I will elaborate further on the interplay between descriptive and
8
normative aspects of the epistemology of measurement when I discuss my methodology
below.
As mentioned, the attempt to answer the three questions above gives rise to three
epistemological problems: the problem of coordination, the problem of accuracy and the
problem of quantity individuation, respectively. The next three subsections will introduce
these problems, and the fourth subsection will discuss their mutual entanglement.
The problem of coordination
How can one tell whether a given empirical procedure measures a given quantity? For
example, how can one tell that an instrument is a thermometer, i.e. that the procedure of its
use results in estimates of temperature? The answer is clear enough if one is allowed to
presuppose, as scientists do today, an accepted theory of temperature along with accepted
standards for measuring temperature. The epistemological conundrum arises when one
attempts to explain the possibility of establishing such theories and standards in the first
place. To establish a theory of temperature one has to be able to test its predictions
empirically, a task which requires a reliable method of measuring temperature; but
establishing such method requires prior knowledge of how temperature is related to other
quantities, e.g. volume or pressure, and this can only be settled by an empirically tested
theory of temperature. It appears to be impossible to coordinate the abstract notion of
temperature to any concrete method of measuring temperature without begging the
question.
9
The problem of coordination was discussed by Mach ([1896] 1966) in his analysis of
temperature measurement and by Poincaré ([1898] 1958) in relation to the measurement of
space and time. Both authors took the view that the choice of coordinative principles is
arbitrary and motivated by considerations of simplicity. Which substance is taken to expand
uniformly with temperature, and which kind of clock is taken to ‘tick’ at equal time intervals,
are choices based of convenience rather than observation. The conventionalist solution was
later generalized by Reichenbach ([1927] 1958), Carnap ([1966] 1995) and Ellis (1966), who
understood such coordinative principles (or ‘correspondence rules’) as a priori definitions
that are in no need of empirical verification. Rather than statements of fact, such principles
of coordination were viewed as semantic preconditions for the possibility of measurement.
However, unlike ‘ordinary’ conceptual definitions, conventionalists maintained that
coordinative definitions do not fully determine the meaning of a quantity concept but only
regulates its use. For example, what counts as an accurate measurement of time depends on
which type of clock is chosen to regulate the application of the notion of temporal
uniformity. But the extension of the notion of uniformity is not limited to that particular
type of clock. Other types of clock may be used to measure time, and their accuracy is
evaluated by empirical comparison to the conventionally chosen standard4.
Another approach to the problem of coordination, closely aligned with but distinct
from conventionalism, was defended by Bridgman (1927). Bridgman’s initial proposal was to
define a quantity concept directly by the operation of its measurement, so that strictly
speaking two different types of operation necessarily measure different quantities. The
4 See, for example, Carnap on the periodicity of clocks (1995 [1966], 84). For a discussion of the differences between operationalism and conventionalism see Chang and Cartwright (2008, 368.)
10
operationalist solution is more radical than conventionalism, as it reduces the meaning of a
quantity concept to its operational definition. Bridgman motivated this approach by the need
to exercise caution when applying what appears to be the same quantity concept across
different domains. Bridgman later modified his view in response to various criticisms and no
longer viewed operationalism as a comprehensive theory of meaning (Bridgman 1959,
Chang 2009, 2.1).
A new strand of writing on the problem of coordination has emerged in the last
decade, consisting most notably of the works of Chang (2004) and van Fraassen (2008, Ch.
5). These works take a historical-contextual and coherentist approach to the problem. Rather
than attempt a solution from first principles, these writers appeal to considerations of
coherence and consistency among different elements of scientific practice. The process of
theory-construction and standardization is seen as mutual and iterative, with each iteration
respecting existing traditions while at the same time correcting them. At each such iteration
the quantity concept is re-coordinated to a more robust set of standards, which in turn
allows theoretical predictions to be tested more accurately, etc. The challenge for these
writers is not to find a vantage point from which coordination is deemed rational a priori,
but to trace the inferential and material apparatuses responsible for the mutual refinement of
theory and measurement in any specific case. Hence they reject the traditional question:
‘what is the general solution to the problem of coordination?’ in favour of historically
situated, local investigations.
As will become clear, my approach to the problem of coordination continues the
historical-contextual and coherentist trend in recent scholarship, but at the same time seeks
to specify general formal features common to successful solutions to this problem. Rather
than abandon traditional approaches to the problem altogether, my aim will be to shed new
11
light on, and ultimately improve upon, conventionalist and operationalist attempts to solve
the problem of coordination. To this end I will provide a novel account of what it means to
coordinate quantity concepts to physical operations – an account in which coordination is
understood as a process rather than a static definition – and clarify the conventional and
empirical aspects of this process.
The problem of accuracy
Even if one can safely assume that a given procedure measures the quantity it is
intended to, a second problem arises when one tries to evaluate the accuracy of that
procedure. Quantities such as length, duration and temperature, insofar as they are
represented by non-integer (e.g. rational or real) numbers, cannot be measured with
complete accuracy. Even measurements of integer-valued quantities, such as the number of
alpha-particles discharged in a radioactive reaction, often involve uncertainties. The accuracy
of measurements of such quantities cannot, therefore, be evaluated by reference to exact
values but only by comparing uncertain estimates to each other. Such comparisons by their
very nature cannot determine the extent of error associated with any single estimate but only
overall mutual compatibility among estimates. Hence multiple ways of distributing errors
among estimates are possible that are all consistent with the evidence gathered through
12
comparisons. It seems that claims to accuracy are intrinsically underdetermined by any
possible evidence5.
Many of the authors who have discussed the problem of coordination appear to have
also identified the problem of accuracy, although they have not always distinguished the two
very clearly. Often, as in the cases of Mach, Ellis and Carnap, they naively believed that
fixing a measurement standard in an arbitrary manner is sufficient to solve both problems at
once. However, measurement standards are physical instruments whose construction,
maintenance, operation and comparison suffer from uncertainties just like those of other
instruments. As I will show, the absolute accuracy of measurement standards is nothing but
a myth that obscures the complexity behind the problem of accuracy. Indeed, I will argue
that the role played by standards in the evaluation of measurement accuracy has so far been
grossly misunderstood by philosophers. Once the epistemic role of standards is clarified,
new and important insights emerge not only with respect to the proper solution to the
problem of accuracy but also with respect to the other two problems.
The problem of quantity individuation
When discussing the previous two problems I implicitly assumed that it is possible to
tell whether multiple measuring procedures, compared to each other either synchronically or
diachronically, measure the same quantity. But this assumption quickly leads to another
underdetermination problem, which I call the ‘problem of quantity individuation.’ Even
5 See also Kyburg (1984, 183)
13
when two different procedures are thought to measure the same quantity, their outcomes
rarely exactly coincide under similar conditions. Therefore when the outcomes of two
procedures appear to disagree with each other two kinds of explanation are open to
scientists: either one (or both) of the procedures are inaccurate, or the two procedures
measure different quantities6. But any empirical test that may be brought to bear on this
dilemma necessarily presupposes additional facts about agreement or disagreement among
measurement outcomes and merely duplicates the problem. Much like claims about
accuracy, claims about quantity individuation are underdetermined by any possible evidence.
As Chapter 2 will make clear, existing philosophical accounts of quantity individuation
do not fully acknowledge the import of the problem. Bridgman and Ellis, for example, both
acknowledge that claims to quantity individuation are underdetermined by facts about
agreement and disagreement among measuring instruments. And yet they fail to notice that
facts about agreement and disagreement among measuring instruments are themselves
underdetermined by the indications of those instruments. Once this additional level of
underdetermination is properly appreciated, Bridgman and Ellis’ proposed criteria of
quantity individuation are exposed as question-begging. A proper solution to the problem of
quantity individuation, I will argue, is possible only if one takes into account its
entanglement with the first two problems.
6 This second option may be further subdivided into sub-options. The two procedures may be measuring different quantity tokens of the same type, e.g. lengths of different objects, or two different types of quantity altogether, e.g. length and area.
14
Epistemic entanglement
Though conceptually distinct, I will argue that the three problems just mentioned are
epistemically entangled, i.e. that they cannot be solved independently of one another.
Specifically, I will show that (i) it is impossible to test whether a given procedure P measures
a given quantity Q without at the same time testing how accurately procedure P would
measure quantity Q; (ii) it is impossible to test how accurately procedure P would measure
quantity Q without comparing it to some other procedure P′ that is supposed to measure Q;
and (iii) it is impossible to test whether P and P′ measure the same quantity without at the
same time testing whether they measure some given quantity, e.g. Q. Note that these
‘impossibility theses’ are epistemic rather than logical. For example, it is logically possible to
know that two procedures measure the same quantity without knowing which quantity they
measure7. Nevertheless, it is epistemically impossible to test whether two procedures
measure the same quantity without making substantive assumptions about the quantity they
are supposed to measure.
The extent and consequences of this epistemic entanglement have hitherto remained
unrecognized by philosophers, despite the fact that some of the problems themselves have
been widely acknowledged for a long time. The model-based account presented here is the
first epistemology of measurement to clarify how it is possible in general to solve all three
problems simultaneously without getting caught in a vicious circle.
7 The opposite is not the case, of course: one cannot (logically speaking) know which quantities two procedures measure without knowing whether they measure the same quantity. Questions 1 and 3 are therefore logically related, but not logically equivalent.
15
4. The challenge from practice
Apart from solving abstract and general problems like those discussed in the previous
section, a central challenge for the epistemology of measurement is to make sense of specific
measurement methods employed in particular disciplines. Indeed, it would be of little value
to suggest a solution to the abstract problems that has no bearing on scientific practice, as
such solution would not be able to clarify whether and how accepted measurement methods
actually overcome these problems. The ‘challenge from practice’, then, is to shed light on the
epistemic efficacy of concrete methodologies of measurement and standardization. How do
such methods overcome the three general epistemological problems discussed above? As
already mentioned, this thesis will focus on the standardization of physical measuring
instruments. Physical metrology involves a variety of methods for instrument comparison,
error detection and correction, uncertainty evaluation and calibration. These methods
employ theoretical and statistical tools as well as techniques of experimental manipulation
and control. A central desideratum for the plausibility of the model-based account will be its
ability to explain how, and under what conditions, these methods support knowledge claims
about measurement, accuracy and quantity individuation.
As my focus will be on physical metrology, I will pay special attention to the
methodological guidelines developed by practitioners in that field. In particular, I will
frequently refer to two documents published in 2008 by the Joint Committee for Guides in
Metrology (JCGM), a committee that represents eight leading international standardization
16
bodies8. The first document is the International Vocabulary of Metrology – Basic and General
Concepts and Associated Terms (VIM), 3rd edition (JCGM 2008)9. This document contains
definitions and clarificatory remarks for dozens of key concepts in metrology such as
calibration, measurement accuracy, measurement precision and measurement standard.
These definitions shed light on the way practitioners understand these concepts and on their
underlying (and sometimes conflicting) epistemic and metaphysical commitments. The
second document is titled Evaluation of Measurement Data — Guide to the Expression of
Uncertainty in Measurement (GUM), 1st edition (JCGM 2008a). This document provides
detailed guidelines for evaluating measurement uncertainties and for comparing the results
of different measurements. Together these two documents portray a methodological picture
of metrology in which abstract and idealized representations of measurement processes play
a central role. However, being geared towards regulating practice, these documents do not
explicitly analyze the presuppositions underlying this methodological picture nor its efficacy
for overcoming general epistemological conundrums that are of interest to philosophers. It
is this gap between methodology and epistemology that the model-based account of
measurement is intended to fill.
8 The JCGM is composed of representatives from the International Bureau of Weights and Measures (BIPM), the International Electrotechnical Commission (IEC), the International Federation of Clinical Chemistry and Laboratory Medicine (IFCC), the International Laboratory Accreditation Cooperation (ILAC), the International Organization for Standardization (ISO), the International Union of Pure and Applied Chemistry (IUPAC), the International Union of Pure and Applied Physics (IUPAP) and the International Organization of Legal Metrology (OIML).
9 A new version of the 3rd edition of the VIM with minor changes was published in early 2012. My discussion in this thesis applies equally to this new version.
17
5. The model-based account
According to the model-based account, a necessary precondition for the possibility of
measuring is the specification of an abstract and idealized model of the measurement process. To
measure a physical quantity is to make coherent and consistent inferences from the final
state(s) of a physical process to value(s) of a parameter in the model. Prior to the
subsumption of a process under some idealized assumptions, it is impossible to ground such
inferences and hence impossible to obtain a measurement outcome. Rather than be given by
observation, measurement outcomes are sensitive to the assumptions with which a
measurement process is modelled and may change when these assumptions change. The
same holds true for estimates of measurement uncertainty, accuracy and error, as well as for
judgements about agreement and disagreement among measurement outcomes – all are
relative to the assumptions under which the relevant measurement processes are modelled.
My conception of the nature and functions of models follows closely the views
expressed in Morrison and Morgan (1999), Morrison (1999), Cartwright et al. (1995) and
Cartwright (1999). I take a scientific model to be an abstract representation of some local
phenomenon, a representation that is used to predict and explain aspects of that
phenomenon. A model is constructed out of assumptions about the ‘target’ phenomenon
being represented. These assumptions may include laws and principles from one or more
theories, empirical generalizations from available data, statistical assumptions about the data,
and other local (and sometimes ad hoc) simplifying assumptions about the phenomenon of
interest. The specialized character of models allows them to function autonomously from
the theories that contributed to their construction, and to mediate between the highly
abstract assumptions of theory and concrete phenomena. I view models as instruments that
18
are more or less useful for purposes of prediction, explanation, experimental design and
intervention, rather than as descriptions that are true or false.
Though not committed to any particular view on how models represent the world, the
model-based account does not require models to mirror the structure of their target systems
in order to be successful representational instruments. My framework therefore differs from
the ‘semantic’ view, which takes models to be set-theoretical relational structures that are
isomorphic to relations among objects in the target domain (Suppes 1960, van Fraassen
1980, 41-6). The model-based account is also permissive with respect to the ontology of
models, and apart from assuming that models are abstract constructs I do not presuppose
any particular view concerning their nature (e.g. abstract entities, mathematical objects,
fictions). I do, however, take models to be non-linguistic entities and hence different from
the equations used to express their assumptions and consequences10.
The epistemic functions of models have received far less attention in the context of
measurement than in other contexts where models are used to produce knowledge, e.g.
theory construction, prediction, explanation, experimentation and simulation. An exception
to this general neglect is the use of models for measurement in economics, a topic about
which philosophers have gained valuable insights in recent years (Boumans 2005, 2006,
2007; Morgan 2007). The Representational Theory of Measurement (Krantz et al 1971)
appeals to models in the set-theoretical sense to elucidate the adequacy of different types of
scales, but completely neglects epistemic questions concerning coordination, accuracy and
quantity individuation. This thesis will focus on the epistemic functions of models in
10 On this last point my terminology is at odds with that of the VIM, which defines a measurement model as a set of equations. Cf. JCGM 2008, 2.48 “Measurement Model”, p. 32.
19
physical measurement, a topic on which relatively little has been written, and to date no
systematic account has been offered11.
The models I will discuss represent measurement processes. Such processes have physical
and symbolic aspects. The physical aspect of a measurement process, broadly construed,
includes interactions between a measuring instrument, one or more measured samples, the
environment and human operators. The symbolic aspect includes data processing operations
such as averaging, data reduction and error correction. The primary function of models of
measurement processes is to represent the final states – or ‘indications’ – of the process in
terms of values of the measured quantity. For example, the primary function of a model of a
cesium fountain clock is to represent the output frequency of the clock (the frequency of its
‘ticks’) in terms of the ideal frequency associated with a specific hyperfine transition in
cesium-133. To do this, the model of the clock must incorporate theoretical and statistical
assumptions about the working of the clock and its interactions with the cesium sample and
the environment, as well as about the processing of the output frequency signal.
A measurement procedure is a measurement process as represented under a particular set
of modeling assumptions. Hence multiple procedures may be instantiated on the basis of the
same measurement process when the latter is represented with different models12. For
example, the same interactions among various parts of a cesium fountain clock and its
11 But see important contributions to this topic by Morrison (2009) and Frigerio, Giordani and Mari (2010).
12 Here too I have chosen to slightly deviate from the terminology of the VIM, which defines a measurement procedure as a description of a measurement process that is based on a measurement model (JCGM 2008, 2.6, p. 18). I use the term, by contrast, to denote a measurement process as represented by a measurement model. The difference is that in the VIM definition a procedure does not itself measure but only provides instructions on how to measure, whereas in my definition a procedure measures.
20
environment may instantiate different procedures for measuring time when modelled with
different assumptions.
According to the model-based account, knowledge claims about coordination,
accuracy and quantity individuation are properly ascribable to measurement procedures rather
than to measurement processes. That is, such knowledge claims presuppose that the
measurement process in question is already subsumed under specific idealized assumptions,
and may therefore be judged as true or false only relative to those assumptions. The central
reason for this model-relativity is that prior to the subsumption of a measurement process
under a model it is impossible to warrant objective claims about the outcomes of
measurement, that is, claims that reasonably ascribe the outcome to the object being measured
rather than to idiosyncrasies of the procedure. This will be explained in detail in Chapter 2.
As I will argue, the model-based account meets both the abstract and practice-based
challenges I have discussed. Once the inferential grounds of measurement claims are
relativized to a representational context, it becomes clear how all three epistemic problems
mentioned above may be solved simultaneously. Moreover, it becomes clear how
contemporary metrological methods of standardization, calibration and uncertainty
evaluation are able to solve these problems, and what practical considerations and trade-offs
are involved in the application of such methods. Finally, it becomes clear why measurement
outcomes retain their objective validity outside the representational context in which they are
obtained, thereby avoiding problems of incommensurability across different measuring
procedures.
In providing a model-based epistemology of measurement, I intend to offer neither a
critique nor an endorsement of metaphysical realism with respect to measurable quantities.
My account remains agnostic with respect to metaphysics and pertains to measurement
21
solely as an epistemic activity, i.e. to the inferences and assumptions that make it possible to
warrant knowledge claims by operating measuring instruments. For example, nothing in my
account depends on whether or not ratios of mass (or length, or duration) exist mind-
independently. Indeed, in Chapters 1 and 4 I show that the problem of accuracy is solved in
exactly the same way regardless of whether one interprets measurement uncertainties as
deviations from true quantity values or as estimates of the degree of mutual consistency
among the consequences of different models. The question of realism with respect to
measurable quantities is therefore independent of the epistemology of measurement and
underdetermined by any evidence one can gather from the practice of measuring.
6. Methodology
As I have mentioned, the model-based account is designed to meet both general
epistemological challenges and challenges from practice. These two sorts of challenge may
be distinguished along the lines of a normative-descriptive divide and formulated as two
questions:
1. Normative question: what are some of the formal desiderata for an adequate
solution to the problems of coordination, accuracy and quantity
individuation?
2. Descriptive question: do the methods employed in physical metrology satisfy
these desiderata?
22
It is tempting to try to answer these questions separately – first by analyzing the
abstract problems and arriving at formal desiderata for their solution, and then by surveying
metrological methods for compatibility with these desiderata. But on a closer look it
becomes clear that these two questions cannot be answered completely independently of
each other. Much like the first-order problems, these questions are entangled. On the one
hand, overly strict normative desiderata would lead to the absurdity that no method can
resolve the problems (why this is an absurdity was discussed above). An example of an
overly strict desideratum is the requirement that measurement processes be perfectly
repeatable, a demand that is unattainable in practice. On the other hand, overly lenient
desiderata would run the risk of vindicating methods that practitioners regard as flawed.
While not necessarily absurd, if such cases abounded they would eventually raise the worry
that one’s normative account fails to capture the problems that practitioners are trying to
solve. To avoid these two extremes, the epistemologist must be able to learn from practice what
counts a good solution to an epistemological problem, yet do so without relinquishing the
normativity of her account.
These seemingly conflicting needs are fulfilled by a method I call ‘normative analysis of
exemplary cases’. I provide original and detailed case studies of metrological methods that
practitioners consider exemplary solutions to the general epistemological problems posed
above. Being exemplary solutions, they must also come out as successful solutions in my own
epistemological account, for otherwise I have failed to capture the problems that
metrologists are trying to solve. Note that this is not a license to believe everything
practitioners say, but merely a reasonable starting point for a normative analysis of practice.
In other words, this method reflects a commitment to learn from practitioners what their
23
problems are and assess their success in solving these problems rather than the preconceived
problems of philosophers.
For my main case studies I have chosen to concentrate on the standardization of time
and frequency, the most accurately and stably realized physical quantities in contemporary
metrology. In addition to a study of the metrological literature, I spent several weeks at the
laboratories of the Time and Frequency Division at the US National Institute of Standards
and Technology (NIST) in Boulder, Colorado. I conducted interviews with ten of the
Division’s scientists as well as with several other specialists at the University of Colorado’s
JILA labs. In these interviews I invited metrologists to reflect on the reasons why they make
certain knowledge claims about atomic clocks (e.g. about their accuracy, errors and
agreement), on the methods they use to validate such claims, and on problems or limitations
they encounter in applying these methods.
These materials then served as the basis for abstracting common presuppositions and
inference patterns that characterize metrological methods more generally. At the same time,
my superficial ‘enculturation’ into metrological life allowed me to reconceptualise the general
epistemological problems and assess their relevance to the concrete challenges of the
laboratory. These ongoing iterations of abstraction and concretization eventually led to a
stable set of desiderata that fit the exemplars and at the same time were general enough to
extend beyond them.
24
7. Plan of thesis
This dissertation consists of four autonomous essays, each dedicated to a different
aspect of the epistemic and methodological challenges mentioned above. Rather than
advance a single argument, each essay contains a self-standing argument in favour of the
model-based account from different but interlocking perspectives.
Chapter 1 is dedicated to primary measurement standards, and debunks the myth
according to which such standards are perfectly accurate. I clarify how the uncertainty
associated with primary standards is evaluated and how the subsumption of standards under
idealized models justifies inferences from uncertainty to accuracy.
Chapter 2 introduces the problem of quantity individuation, and shows that this
problem cannot be solved independently of the problems of coordination and accuracy. The
model-based account is then presented and shown to dissolve all three problems at once.
Chapter 3 expands on the problem of coordination through a discussion of the
construction and maintenance of Coordinated Universal Time (UTC). As I argue, abstract
quantity concepts such as terrestrial time are not coordinated directly to any concrete clock,
but only indirectly through a hierarchy of models. This mediation explains how seemingly ad
hoc error corrections can stabilize the way an abstract quantity concept is applied to
particulars.
Finally, Chapter 4 extends the scope of discussion from standards to measurement
procedures in general by focusing on calibration. I show that calibration is a special sort of
modeling activity, and that measurement uncertainty is a special sort of predictive
uncertainty associated with this activity. The role of standards in calibration is clarified and a
25
general solution to the problem of accuracy is provided in terms of a robustness test among
predictions of multiple models.
26
The Epistemology of Measurement: A Model-Based Account
1. How Accurate is the Standard Second?
Abstract: Contrary to the claim that measurement standards are absolutely accurate by definition, I argue that unit definitions do not completely fix the referents of unit terms. Instead, idealized models play a crucial semantic role in coordinating the theoretical definition of a unit with its multiple concrete realizations. The accuracy of realizations is evaluated by comparing them to each other in light of their respective models. The epistemic credentials of this method are examined and illustrated through an analysis of the contemporary standardization of time. I distinguish among five senses of ‘measurement accuracy’ and clarify how idealizations enable the assessment of accuracy in each sense.13
1.1. Introduction
A common philosophical myth states that the meter bar in Paris is exactly one meter
long – that is, if any determinate length can be ascribed to it in the metric system. One
variant of the myth comes from Wittgenstein, who tells us that the meter bar is the one
thing “of which one can say neither that it is one metre long, nor that it is not one metre
long” (1953 §50). Kripke famously disagrees, but develops a variant of the same myth by
13 This chapter was published with minor modifications as Tal (2011).
27
stating that the length of the bar at a specified time is rigidly designated by the phrase ‘one
meter’ (1980, 56). Neither of these pronouncements is easily reconciled with the 1960
declaration of the General Conference on Weights and Measures, according to which “the
international Prototype does not define the metre with an accuracy adequate for the present
needs of metrology” and is therefore replaced by an atomic standard (CGPM 1961). There
is, of course, nothing problematic with replacing one definition with another. But how can
the accuracy of the meter bar be evaluated against anything other than itself, let alone be
found lacking?
Wittgenstein and Kripke almost certainly did not subscribe to the myth they helped
disseminate. There are good reasons to believe that their examples were meant merely as
hypothetical illustrations of their views on meaning and reference14. This chapter does not
take issue with their accounts of language, but with the myth of the absolute accuracy of
measurement standards, which has remained unchallenged by philosophers of science. The
meter is not the only case where the myth clashes with scientific practice. The second and
the kilogram, which are currently used to define all other units in the International System
(i.e. the ‘metric’ system), are associated with primary standards that undergo routine accuracy
evaluations and are occasionally improved or replaced with more accurate ones. In the case
of the second, for example, the accuracy of primary standards has increased more than a
thousand-fold over the past four decades (Lombardi et al 2007).
This chapter will analyze the methodology of these evaluations, and argue that they
indeed provide estimates of accuracy in the same senses of ‘accuracy’ normally presupposed
14 Wittgenstein mentions the meter bar only in passing as an analogy to color language-games. Kripke carefully notes that the uniqueness of the meter bar’s role in standardizing length is no more than a hypothetical supposition (1980, 55, fn. 20.)
28
by scientific and philosophical discussions of measurement. My main examples will come
from the standardization of time. I will focus on the methods by which time and frequency
standards are evaluated and improved at the US National Institute of Standards and
Technology (NIST). These tasks are carried out by metrologists, experts in highly reliable
measurement. The methods and tools of metrology – a live discipline with its own journals
and controversies – have received little attention from philosophers15. Recent philosophical
literature on measurement has mostly been concerned either with the metaphysics of
quantity and number (Swoyer 1987, Michell 1994) or with the mathematical structures
underlying measurement scales (Krantz et al. 1971). These ‘abstract’ approaches treat the
topics of uncertainty, accuracy and error as extrinsic to the theory of measurement and as
arising merely from imperfections in its application. Though they do not deny that
measurement operations involve interactions with imperfect instruments in noisy
environments, authors in this tradition analyze measurement operations as if these
imperfections have already been corrected or controlled for.
By contrast, the current study is meant as a step towards a practice-oriented
epistemology of physical measurement. The management of uncertainty and error will be
viewed as intrinsic to measurement and as a precondition for the possibility of gaining
knowledge from the operation of measuring instruments. At the heart of this view lies the
recognition that a theory of measurement cannot be neatly separated into fundamental and
applied parts. The methods employed in practice to correct errors and evaluate uncertainties
crucially influence which answers are given to so-called ‘fundamental’ questions about
15 Notable exceptions are Chang (2004) and Boumans (2007). Metrology has been studied by historians and sociologists of science, e.g. Latour (1987, ch. 6), Schaffer (1992), Galison (2003) and Gooday (2004).
29
quantity individuation and the appropriateness of measurement scales. This will be argued in
detail in Chapter 2.
In this chapter I will use insights into metrological practices to outline a novel account
of the underexplored relationship between uncertainty and accuracy. Scientists often include
uncertainty estimates in their reports of measurement results, but whether such estimates
warrant claims about the accuracy of results is an epistemological question that philosophers
have overlooked. Based on an analysis of time standardization, I will argue that inferences
from uncertainty to accuracy are justified when a doubly robust fit – among instruments as
well as among idealized models of these instruments – is demonstrated. My account will
shed light on metrologists’ claims that the accuracy of standards is being continually
improved and on the role played by idealized models in these improvements.
1.2. Five notions of measurement accuracy
Accuracy is often ascribed to scientific theories, instruments, models, calculations and
data, although the meaning of the term varies greatly with context. Even within the limited
context of physical measurement the term carries multiple senses. For the sake of the
current discussion I offer a preliminary distinction among five notions of measurement
accuracy. These are intended to capture different senses of the term as it is used by
physicists and engineers as well as by philosophers of science. The five notions are neither
co-extensive nor mutually exclusive but instead partially overlap in their extensions. As I will
argue below, the sort of robustness test metrologists employ to evaluate the uncertainty of
30
measurement standards provides sufficient evidence for the accuracy of those standards
under all five senses of ‘measurement accuracy’.
1. Metaphysical measurement accuracy: closeness of agreement between a
measured value of a quantity and its true value16
(correlate concept: truth)
2. Epistemic measurement accuracy: closeness of agreement among values
reasonably attributed to a quantity based on its measurement17
(correlate concept: uncertainty)
3. Operational measurement accuracy: closeness of agreement between a
measured value of a quantity and a value of that quantity obtained by
reference to a measurement standard
(correlate concept: standardization)
4. Comparative measurement accuracy: closeness of agreement among
measured values of a quantity obtained by using different measuring systems,
or by varying extraneous conditions in a controlled manner
(correlate concept: reproducibility)
16 cf. “Measurement Accuracy” in the International Vocabulary of Metrology (VIM) (JCGM 2008, 2.13). My own definitions for measurement-related terms are inspired by, but in some cases diverge from, those of the VIM.
17 cf. JCGM 2008, 2.13, Note 3.
31
5. Pragmatic measurement accuracy (‘accuracy for’): measurement accuracy
(in any of the above four senses) sufficient for meeting the requirements of a
specified application.
Let us briefly clarify each of these five notions. First, the metaphysical notion takes
truth to be the standard of accuracy. For example, a thermometer is metaphysically accurate
if its outcomes are close to true ratios between measured temperature intervals and the
chosen unit interval. If one assumes a traditional understanding of truth as correspondence
with a mind-independent reality, the notion of metaphysical accuracy presupposes some
form of realism about quantities. The argument advanced in this chapter is nevertheless
independent of such realist assumptions, as it neither endorses nor rejects metaphysical
conceptions of measurement accuracy.
Second, a thermometer is epistemically accurate if its design and use warrant the
attribution of a narrow range of temperature values to objects. The dispersion of reasonably
attributed values is called measurement uncertainty and is commonly expressed as a value
range18. Epistemic accuracy should not be confused with precision, which constitutes only one
aspect of epistemic accuracy. Measurement precision is the closeness of agreement among
measured values obtained by repeated measurements of the same (or relevantly similar)
18 cf. “Measurement Uncertainty” (JCGM 2008, 2.26.) Note that this term does not refer to a degree of confidence or belief but to a dispersion of values whose attribution to a quantity reasonably satisfies a specified degree of confidence or belief.
32
objects using the same measuring system19. Imprecision is therefore caused by uncontrolled
variations to the equipment, operation or environment when measurements are repeated.
This sort of variation is a ubiquitous but not exclusive source of measurement uncertainty.
As will be explained below, some measurement uncertainty stems from other sources,
including imperfect corrections to systematic errors. The notion of epistemic accuracy is
therefore broader than that of precision.
Third, operational measurement accuracy is determined relative to an established
measurement standard. For example, a thermometer is operationally accurate if its outcomes are
close to those of a standard thermometer when the two measure relevantly similar samples.
The most common way of evaluating operational accuracy is by calibration, i.e. by modeling
an instrument in a manner that establishes a relation between its indications and standard
quantity values20.
Fourth, comparative accuracy is the closeness of agreement among measurement
outcomes when the same quantity is measured in different ways. The notion of comparative
accuracy is closely linked with that of reproducibility. To say that a measurement outcome is
comparatively accurate is to say that it is closely reproducible under controlled variations to
measurement conditions and methods.21 For example, thermometers in a given set are
comparatively accurate if their outcomes closely agree with one another’s when applied to
relevantly similar samples.
19 cf. “Measurement Precision” (JCGM 2008, 2.15) and “Measurement Repeatability” (ibid, 2.21). My concept of precision is narrower than that of the VIM (see also fn. 21.)
20 cf. “Calibration” (JCGM 2008, 2.39) 21 Unlike precision, reproducibility concerns controlled variations to measurement conditions. I deviate
slightly from the VIM on this point to reflect general scientific usage of these terms (cf. “Measurement Reproducibility”, JCGM 2008, 2.25).
33
Finally, pragmatic measurement accuracy is accuracy sufficient for a specific use, such
as a solution to an engineering problem. There are four sub-senses of pragmatic accuracy,
corresponding to the first four senses of measurement accuracy. For example, a
thermometer is pragmatically accurate in an epistemic sense if the overall uncertainty of its
outcomes is low enough to reliably achieve a specified goal, e.g. keeping an engine from
over-heating. Of course, whether or not a measuring system (or a measured value) is
pragmatically accurate depends on its intended use22.
In the physical sciences quantitative expressions of measurement accuracy are typically
cast in epistemic terms, namely in terms of uncertainty. This does not mean that scientific
estimates of accuracy are always and only estimates of epistemic accuracy. What matters to
the classification of accuracy is not the form of its expression, but the kind of evidence on
which estimates of accuracy are based. As I will argue below, metrological evaluations
provide evidence of the right sort for estimating accuracy under all five notions. Before
delving into the argument, the next section will provide some background on the concepts,
methods and problems involved in the standardization of time.
1.3. The multiple realizability of unit definitions
A key distinction in the standardization of physical units is that between definition
and realization. Since 1967 the second has been defined as the duration of exactly
22 Pragmatic accuracy may be understood as a threshold (pass/fail) concept. Alternatively, pragmatic accuracy may be represented continuously, for example as the likelihood of achieving the specified goal. Both analyses of the concept are compatible with the argument presented here.
34
9,192,631,770 periods of the radiation corresponding to a hyperfine transition of cesium-133
in the ground state (BIPM 2006). This definition pertains to an unperturbed cesium atom at
a temperature of absolute zero. Being an idealized description of a kind of atomic system, no
actual cesium atom ever satisfies this definition. Hence a question arises as to how the
reference of ‘second’ is fixed. The traditional philosophical approach would be to propose
some ‘semantic machinery’ through which the definition succeeds in picking out a definite
duration, e.g. a possible-world semantics of counterfactuals. However, this sort of approach
is hard pressed to explain how metrologists are able to experimentally access the extension
of ‘second' given the fact that it is physically impossible to instantiate the conditions
specified by the definition. Consequently, it becomes unclear how metrologists are able to
tell whether the actual durations they label ‘second’ satisfy the definition. By contrast, the
approach adopted in this chapter takes the definition to fix a reference only indirectly and
approximately by virtue of its role in guiding the construction of atomic clocks. Rather than
picking out any definite duration on its own, the definition functions as an ideal specification
for a class of atomic clocks. These clocks approximately satisfy – or in the metrological jargon,
‘realize’ – the conditions specified by the definition23. The activities of constructing and
modeling cesium clocks are therefore taken to fulfill a semantic function, i.e. that of
approximately fixing the reference of ‘second’, rather than simply measuring an already
linguistically fixed time interval.
The construction of an accurate primary realization of the second – a ‘meter stick’ of
time – must make highly sophisticated use of theory, apparatus and data analysis in order to
23 The verb ‘realize’ has various meanings in philosophical discussions. Here I follow the metrological use of this term and take it to be synonymous with ‘approximately satisfy’ (pertaining to a definition.)
35
approximate as much as possible the ideal conditions specified by the definition. But
multiple kinds of physical processes can be constructed that would realize the second, each
departing from the ideal definition in different respects and degrees. In other words,
different clock designs and environments correspond to different ways of de-idealizing the
definition. As of 2009, thirteen atomic clocks around the globe are used as primary
realizations of the second. There are also hundreds of official secondary realizations of the
second, i.e. atomic clocks that are traced to primary realizations. Like any collection of
physical instruments, different realizations of the second disagree with one another, i.e. ‘tick’
at slightly different rates. The definition of the second is thus multiply realizable in the sense
that multiple real durations approximately satisfy the definition, and no method can
completely rid us of the approximations.
That the definition of the second is multiply realizable does not mean that there are
as many ‘seconds’ as there are clocks. What it does mean is that metrologists are faced with
the task of continually evaluating the accuracy of each realization relative to the ideal cesium
transition frequency and correcting its results accordingly. But the ideal frequency is
experimentally inaccessible, and primary standards have no higher standard against which
they can be compared. The challenge, then, is to forge a unified second out of disparately
‘ticking’ clocks. This is an instance of a general problem that I will call the problem of multiple
realizability of unit definitions24. This problem is semantic, epistemological and methodological
all at once. To solve it is to specify experimentally testable satisfaction criteria for the
24 Chang’s (2004, 59) ‘problem of nomic measurement’ is a closely related, though distinct, problem concerning the standardization of instruments. Both problems instantiate the entanglement between claims to coordination, accuracy and quantity individuation mentioned in the introduction to this thesis. This entanglement and its consequences will be discussed in detail in Chapter 2.
36
idealized definition of ‘second’, a task which is equivalent to that of specifying grounds for
making accuracy claims about cesium atomic clocks, which is in turn equivalent to the task
of specifying a method for reconciling discrepancies among such clocks. The conceptual
distinction among three axes of the problem should not obstruct its pragmatic unity, for as
we shall see below, metrologists are able to resolve all three aspects of the problem
simultaneously.
Prima facie, the problem can be solved by arbitrarily choosing one realization as the
ultimate standard. Yet this solution would bind all measurement to the idiosyncrasies of a
specific artifact, thereby causing measurement outcomes to diverge unnecessarily. Imagine
that all clocks were calibrated against the ‘ticks’ of a single apparatus: the instabilities of that
apparatus would cause clocks to run faster or slower relative to each other depending on the
time of their calibration, and the discrepancy would be revealed when these clocks were
compared to each other. A similar scenario has recently unfolded with respect to the
International Prototype of the kilogram when its mass was discovered to systematically ‘drift’
from those of its official copies (Girard 1994). Hence a stipulative approach to unit
definitions exacerbates rather than removes the challenge of reconciling discrepancies
among multiple standards.
The latter point is helpful in elucidating the misunderstanding behind the myth of
absolute accuracy. Once it is acknowledged that unit definitions are multiply realizable, it
becomes clear that no single physical object can be used in practice to completely fix the
reference of a unit term. Rather, this reference must be determined by an ongoing
comparison among multiple realizations. Because these comparisons involve some
uncertainty, the references of unit terms remain vague to some extent. Nevertheless, as the
next sections will make clear, comparisons among standards allow metrologists to minimize
37
this vagueness, thereby providing an optimal solution to the problem of multiple
realizability.
1.4. Uncertainty and de-idealization
The clock design currently implemented in most primary realizations of the second is
known as the cesium ‘fountain’, so called because cesium atoms are tossed up in a vacuum
cylinder and fall back due to gravity. The best cesium fountains are said to measure the
relevant cesium transition frequency with a fractional uncertainty25 of less than 5 parts in 1016
(Jefferts et al. 2007). It is worthwhile to examine how this number is determined. To start
off, it is tempting to interpret this number naively as the standard deviation of clock
outcomes from the ideally defined duration of the second. However, because the definition
pertains to atoms in physically unattainable conditions, the aforementioned uncertainty
could not have been evaluated by direct reference to the ideal second. Nor is this number
the standard deviation of a sample of readings taken from multiple cesium fountain clocks.
If a purely statistical approach of this sort were taken, metrologists would have little insight
into the causes of the distribution and would be unable to tell which clocks ‘tick’ closer to
the defined frequency.
The accepted metrological solution to these difficulties is to de-idealize the
theoretical definition of the second in discrete ‘steps’, and estimate the uncertainty that each
‘step’ contributes to the outcomes of a given clock. The uncertainty associated with a
25 ‘Fractional’ or ‘relative’ uncertainty is the ratio between measurement uncertainty and the best estimate of the value being measured (usually the mean.)
38
specific primary frequency standard is then taken to be the total uncertainty contributed to
its outcomes by a sufficient de-idealization of the definition of the second as it applies to that particular
clock. The rest of the present section will describe how this uncertainty is evaluated, and the
next section will describe what kind of evidence is taken to establish the ‘sufficiency’ of de-
idealization.
Two kinds of de-idealization of the definition are involved in evaluating the
uncertainty of frequency standards. These correspond to two different methods of
evaluating measurement uncertainty that metrologists label ‘type-A’ and ‘type-B’26. First, the
definition of the second is idealized in the sense that it presupposes that the relevant
frequency of cesium is a single-valued number. By contrast, the frequency of any real
oscillation converges to a single value only if averaged over an infinite duration, due to so-
called ‘random’ fluctuations. De-idealizing the definition in this respect means specifying a
set of finite run times, and evaluating the width of the distribution of frequencies for each
run time. Uncertainties evaluated in this manner, i.e. by the statistical analysis of a series of
observations, fall under ‘type-A’.
The second kind of de-idealization of the definition has to do with systematic effects.
For example, one way in which the definition of the second is idealized is that it presupposes
that the cesium atom resides in a completely flat spacetime, i.e. a gravitational potential of
zero. When measured in real conditions on earth, general relativity predicts that the cesium
frequency would be red-shifted depending on the altitude of the laboratory housing the
clock. The magnitude of this ‘bias’ is calculated based on a theoretical model of the earth’s
26 See JCGM (2008a) for a comprehensive discussion. The distinction between type-A and type-B uncertainty is unrelated to that of type I vs. type II error. Nor should it be confused with the distinction between random and systematic error.
39
gravitational field and an altitude measurement27. The measurement of altitude itself involves
some uncertainty, which propagates to the estimate of the shift and therefore to the
corrected outcomes of the clock. This sort of uncertainty, i.e. uncertainty associated with
corrections to systematic errors, falls under ‘type-B’.
In addition to gravitational effects, numerous other effects must be estimated and
corrected for a cesium fountain. With every such de-idealization and correction, some type-
B uncertainty is added to the final outcome, i.e. to the number of ‘ticks’ the clock is said to
have generated in a given period. The overall type-B uncertainty associated with the clock is
then taken to be equal to the root sum of squares of these individual uncertainties28. In other
words, the type-B uncertainty of a primary standard is determined by the accumulated
uncertainty associated with corrections applied to its readings. The general method of evaluating
the overall accuracy of measuring systems in this way is known as uncertainty budgeting.
Metrologists draw up tables with the contribution of each correction and a ‘bottom line’ that
expresses the total type-B uncertainty (an example will be given in Section 1.6). Such tables
make explicit the fact that ‘raw ticks’ generated by a clock are by themselves insufficient to
determine the uncertainty associated with that clock. Uncertainties crucially depend not only
on the apparatus, but also on how the apparatus is modeled, and on the level of detail with
which such models capture the idiosyncrasies of a particular apparatus.
27 The calculation of this shift involves the postulation of an imaginary rotating sphere of equal local gravitational potential called a geoid, which roughly corresponds to the earth’s sea level. Normalization to the geoid is intended to transform the proper time of each clock to the coordinate time on the geoid. See for example Jefferts et al (2002, 328).
28 This method of adding uncertainties is only allowed when it is safe to assume that uncertainties are uncorrelated.
40
1.5. A robustness condition for accuracy
We saw that metrologists successively de-idealize the definition of the second until it
describes the specific apparatus at hand. The type-A and type-B uncertainties accumulated in
this process are combined to produce an overall uncertainty estimate for a given clock. This
is how, for example, metrologists arrived at the estimate of fractional frequency uncertainty
cited in the previous section.
A question nevertheless remains as to how metrologists determine the point at which
de-idealization is ‘sufficient’. After all, a complete de-idealization of any physical system is itself
an unattainable ideal. Indeed, the most difficult challenges that metrologists face involve
building confidence in descriptions of their apparatus. Such confidence is achieved by
pursuing two interlocking lines of inquiry: on the one hand, metrologists work to increase
the level of detail with which they model clocks. On the other hand, clocks are continually
compared to each other in light of their most recent theoretical and statistical models. The
uncertainty budget associated with a standard is then considered sufficiently detailed if and
only if these two lines of inquiry yield consistent results. The upshot of this method is that
the uncertainty ascribed to a standard clock is deemed adequate if and only if the outcomes of
that clock converge to those of other clocks within the uncertainties ascribed to each clock by appropriate
models, where appropriateness is determined by the best currently available theoretical
knowledge and data-analysis methods. This kind of convergence is routinely tested for all
active cesium fountains (Parker et al 2001, Li et al 2004, Gerginov 2010) as well as for
candidate future standards, as will be shown below.
41
The requirement for convergence under appropriate models embeds a double
robustness condition, which may be generalized in the following way:
(RC) Given multiple, sufficiently diverse realizations of the same unit, the
uncertainties ascribed to these realizations are adequate if and only if
(i) discrepancies among realizations fall within their ascribed
uncertainties; and
(ii) the ascribed uncertainties are derived from appropriate models of
each realization.
These two conditions loosely correspond to what Woodward (2006) calls
‘measurement robustness’ and ‘derivational robustness’. The first kind of robustness
concerns the stability of a measured value under varying measurement procedures, while the
second concerns the stability of a prediction under varying modeling assumptions. Note,
however, that in the present case we are not dealing with two independently satisfiable
conditions, but with two sides of a single, composite robustness condition. Recall that the
discrepancies mentioned in sub-condition (i) already incorporate corrections to the quantity
being compared, corrections that were calculated in light of detailed models of the relevant
apparatuses. Conversely, the ‘appropriateness’ of models in (ii) is considered sufficiently
established only once it is shown that these models correctly predict the range of
discrepancies among realizations.
Metrology teaches us that (RC) is indeed satisfied in many cases, sometimes with
stunningly small uncertainties. However, the question remains as to why one should take
uncertainties that satisfy this condition to be measures of the accuracy of standards. This
42
question can be answered by considering each of the five variants of accuracy outlined
above.
To start with the most straightforward case, the comparative accuracy of realizations
is simply the closeness of agreement among them, e.g. the relative closeness of the
frequencies of different cesium fountains. Clearly, uncertainties that fulfill sub-condition (i)
are (inverse) estimates of accuracy in this sense.
Second, from an operational point of view, the accuracy of a standard is the
closeness of its agreement to other standards of the same quantity. This is again explicitly
guaranteed by the fulfillment of (RC) under sub-condition (i). That sub-condition (i)
guarantees two types of accuracy is hardly surprising, since in the special case of
comparisons among standards the notions of comparative and operational accuracy are
coextensive.
Third, the epistemic conception of accuracy identifies the accuracy of a standard
with the narrowness of spread of values reasonably attributed to the quantity realized by that
standard. The evaluation of type-A and type-B uncertainties in light of current theories,
models and data-analysis tools is plausibly the most rigorous way of estimating the range of
durations that reasonably satisfy the definition of ‘second’. The appropriateness requirement
in sub-condition (ii) guarantees that uncertainties are evaluated in this way whenever
possible.
Fourth, according to the metaphysical conception of accuracy, the accuracy of a
standard is the degree of closeness between the estimated and true values of the realized
quantity. Here one may adopt a skeptical position and claim that the true values of physical
quantities are generally unknowable. The skeptic is in principle correct: it may be the case
that despite their diversity, all the measurement standards that metrologists have compared
43
are plagued by a common systematic effect that equally influences the realized quantity and
thus remains undetected. But for a non-skeptical realist who believes (for whichever reason)
that current theories are true or approximately true, condition (RC) provides a powerful test
for metaphysical accuracy because it relies on the successive de-idealization of the theoretical
definition of the relevant unit. Estimating the metaphysical accuracy of a cesium clock, for
example, amounts to determining the conceptual ‘distance’ of that clock from the ideal
conditions specified by the definition of the second. As mentioned, the uncertainties that go
into (RC) are consequences of precisely those respects in which the realization of a unit falls
short of the definition. It is therefore plausible to consider cross-checked uncertainty
budgets of multiple primary standards as supplying good estimates of metaphysical accuracy.
Nevertheless, it is important to note that condition (RC) and the method of uncertainty
budgeting do not presuppose anything about the truth of our current theories or the reality of
quantities. That is, (RC) is compatible with a non-skeptical realist notion of accuracy without
requiring commitment to its underlying metaphysics.
Finally, from a pragmatic point of view the accuracy of a standard is its capacity to
meet the accuracy needs of a certain application. Here the notion of ‘accuracy needs’ is
cashed out in terms of one of the first four notions of accuracy. As the uncertainties
vindicated by (RC) have already been shown to be adequate estimates of accuracy under the
first four notions, they are ipso facto adequate for the estimation of pragmatic accuracy.
44
1.6. Future definitions of the second
The methodological requirement to maximize robustness to the limit of what is
practically possible is one of the main reasons why unit definitions are not chosen arbitrarily.
If the definitions of units were determined arbitrarily, their replacement would be arbitrary
as well. But, as metrologists know only too well, changes to unit definitions involve a
complex web of theoretical, technological and economic considerations. Before the
metrological community accepts a new definition, it must be convinced that the relevant unit
can be realized more accurately with the new definition than with the old one. Here again ‘accuracy’ is
cashed in terms of robustness. In the case of the second, for example, a new generation of
‘optical’ atomic clocks is already claimed to have achieved “an accuracy that exceeds current
realizations of the SI unit of time” (Rosenband et al. 2008, 1809). To demonstrate accuracy
that surpasses the current cesium standard, optical clocks are compared to each other in light
of their most detailed models available. Table 1.1 presents a comparative uncertainty budget
for aluminum and mercury optical clocks recently evaluated at NIST. The theoretical
description of each atomic system is de-idealized successively, and the uncertainties
contributed by each component add up to the ‘bottom line’ type-B uncertainty for each
clock. These uncertainties are roughly an order of magnitude lower than those ascribed to
cesium fountain clocks.
45
Table 1.1: Comparison of uncertainty budgets of aluminum (Al) and mercury (Hg) optical atomic clocks. This table was used to support the claim that both clocks are more accurate than the current cesium standard. ∆ν stands for fractional frequency bias and σ stands for uncertainty, both expressed
in 10-18. (source: Rosenband et al 2008, 1809. Reprinted with permission from AAAS)
The experimenters showed that successive comparisons of the frequencies of these
clocks indeed yield outcomes that fall within the ascribed bounds of uncertainty, thereby
applying the robustness condition above. The fact that these clocks involve two different
kinds of atoms was taken to strengthen the robustness of the results. Nevertheless, it is
unlikely that the second will be redefined in the near future in terms of an optical transition.
More optical clocks must be built and compared before metrologists are convinced that such
clocks are modeled with sufficient detail. Meanwhile the accuracy of current cesium
standards is still being improved by employing new methods of controlling and correcting
for errors. In the long run, however, increasing technological challenges involved in
improving the accuracy of cesium fountains are expected to lead to the adoption of new
sorts of primary realizations of the second such as optical clocks.
46
1.7. Implications and conclusions
As the foregoing discussion has made clear, measurement standards are not
absolutely accurate, nor are they chosen arbitrarily. Moreover, unit definitions do not
completely fix the reference of unit terms, unless ‘fixing’ is understood in a manner that is
utterly divorced from practice. Instead, choices of unit definition, as well as choices of
realization for a given unit definition, are informed by intricate considerations from theory,
technology and data analysis.
The study of these considerations reveals the ongoing nature of standardization
projects. Theoretically, quantities such as mass, length and time are represented by real
numbers on continuous scales. The mathematical treatment of these quantities is indifferent
to the accuracy with which they are measured. But in practice, we saw that the procedures
required to measure duration in seconds change with the degree of accuracy demanded.
Consequently, a necessary condition for the possibility of increasing measurement accuracy
is that unit-concepts are continually re-coordinated with new measuring procedures29. Metrologists are
responsible for performing such acts of re-coordination in the most seamless manner
possible, so that for all practical purposes the second, meter and kilogram appear to remain
unchanged. This is achieved by constructing and improving primary and secondary
realizations, and (less frequently) by redefinition. The dynamic coordination of quantity
concepts with increasingly robust networks of instruments allows measurement results to
retain their validity even when standards are improved or replaced. Moreover, increasing
29 See van Fraassen’s discussion of the problem of coordination (2008, ch.5). I take my own robustness condition (RC) to be a methodological explication of van Fraassen’s ‘coherence constraint’ on acceptable solutions to this problem.
47
robustness minimizes vagueness surrounding the reference of unit terms, thereby providing
an optimal solution to the problem of multiple realizability of unit definitions.
48
The Epistemology of Measurement: A Model-Based Account
2. Systematic Error and the Problem of Quantity Individuation
Abstract: When discrepancies are discovered between outcomes of different measuring instruments two sorts of explanation are open to scientists. Either (i) some of the outcomes are inaccurate or (ii) the instruments measure different quantities. Here I argue that, due to the possibility of systematic error, the choice between (i) and (ii) is in principle underdetermined by the evidence. This poses a problem for several contemporary philosophical accounts of measurement, which attempt to analyze ‘foundational’ concepts like quantity independently of ‘applied’ concepts like error. I propose an alternative, model-based account of measurement that challenges the distinction between foundations and application, and show that this account dissolves the problem of quantity individuation.
2.1. Introduction
Physical quantities – the speed of light, the melting point of gold, the earth’s diameter
– can often be measured in more than one way. Instruments that measure a given quantity
may differ markedly in the physical principles they utilize, and it is difficult to imagine
scientific inquiry proceeding were this not the case. The possibility of measuring the same
quantity in different ways is crucial to the detection of experimental errors and the
development of general scientific theories. An important question for any epistemology of
49
measurement is therefore: ‘how are scientists able to know whether or not different
instruments measure the same quantity?’
However straightforward this question may seem, an adequate account of quantity
individuation across measurement procedures has so far eluded philosophical accounts of
measurement. Contemporary measurement theories either completely neglect this question
or provide overly simplistic answers. As this chapter will show, the question of quantity
individuation is of central concern to theories of measurement. Not only is the question
more difficult than previously thought, but when properly appreciated the challenge posed
by this question undermines a widespread presupposition in contemporary philosophy of
measurement. This presupposition will be referred to here as conceptual foundationalism.
Prevalent in the titles of key works such as Ellis’ Basic Concepts of Measurement (1966) and
Krantz et al’s Foundations of Measurement (1971), conceptual foundationalism is the thesis that
measurement concepts are rigidly divided into ‘fundamental’ and ‘applied’ types, the former
but not the latter being the legitimate domain of philosophical analysis. Fundamental
measurement concepts – particularly, the notions of quantity and scale – are supposed to
have universal criteria of existence and identity. Such criteria apply to any measurement
regardless of its specific features. For example, whether or not two procedures measure the
same quantity is determined by applying a universal criterion of quantity identity to their
results, regardless of which quantity they happen to measure or how accurately they happen
to measure it. By contrast, ‘applied’ concepts like accuracy and error are seen as experimental
in nature. Discussion of the ‘applied’ portion of measurement theory is accordingly left to
laboratory manuals or other forms of discipline-specific technical literature.
As I will argue in this chapter, conceptual foundationalist approaches do not, and
cannot, provide an adequate analysis of the notion of measurable quantity. This is because
50
the epistemic individuation of measurable quantities essentially depends on considerations of
error distribution across measurement procedures. Questions of the form ‘what quantity
does procedure P measure?’ cannot be answered independently of questions about the
accuracy of P. Deep conceptual and epistemic links tie together the so-called ‘fundamental’
and ‘applied’ parts of measurement theory and prevent identity criteria from being specified
for measurable quantities independently of the specific circumstances of their measurement.
The main reason that these links have been ignored thus far is a misunderstanding of
the notion of measurement error, and particularly systematic measurement error. The
possibility of systematic error – if it is acknowledged at all in philosophical discussions of
measurement – is usually brought up merely to clarify its irrelevance to the discussion30. The
next section of this chapter will therefore be dedicated to an explication of the idea of
systematic error and its relation to theoretical and statistical assumptions about the specific
measurement process. These insights will be used to generate a challenge for the conceptual
foundationalist that I will call ‘the problem of quantity individuation.’ The following section
will discuss the ramifications of this problem for several conceptual foundationalist theories
of measurement, including the Representational Theory of Measurement (Krantz et al 1971.)
Finally, Section 2.4 will present an alternative, non-foundationalist account of quantity
individuation. I will argue that claims to quantity individuation are adequately tested by
establishing coherence and consistency among models of different measuring instruments.
The account will serve to elucidate the model-based approach to measurement and to
demonstrate its ability to avoid conceptual problems associated with foundationalism.
Moreover, the model-based approach will provide a novel understanding of the epistemic
30 See for example Campbell (1920, pp. 471-3)
51
functions of systematic errors. Instead of being conceived merely as obstacles to the
reliability of experiments, systematic errors will be shown to constitute indispensable tools
for unifying quantity concepts in the face of seemingly inconsistent evidence.
2.2. The problem of quantity individuation
2.2.1. Agreement and error
How can one tell whether two different instruments measure the same quantity? This
question poses a fundamental challenge to the epistemology of measurement. For any
attempt to test whether two instruments measure the same quantity, either by direct
comparison or by reference to other instruments, involves testing for agreement among
measurement outcomes; but any test of agreement among measurement outcomes must
already presuppose that those outcomes pertain to the same quantity.
To clarify the difficulty, let us first consider the sort of evidence required to establish
agreement among the outcomes of different measurements. We can imagine two
instruments that are intended to measure the same quantity, such as two thermometers. For
the sake of simplicity, we may assume that the instruments operate on a common set of
measured samples. Now suppose that we are asked to devise a test that would determine
whether the two instruments agree in their outcomes when they are applied to samples in
the set.
Naively, one may propose that the instruments agree if and only if their indications
exactly coincide when introduced with the same samples under the same conditions. But
variations in operational or environmental conditions cause indications to diverge between
52
successive measurements, and this should not count as evidence against the claim that the
outcomes of the two instruments are compatible.
A more sophisticated proposal would be to repeat the comparison between the
instruments several times under controlled conditions and to use one of any number of
statistical tests to determine whether the difference in readings is consonant with the
hypothesis of agreement between the instruments. This procedure would determine whether
the readings of the two instruments coincide within type-A (or ‘statistical’) uncertainty, the
component of measurement uncertainty typically associated with random error31.
However, due to the possibility of systematic error, coincidence within type-A
uncertainty is neither a necessary nor sufficient criterion for agreement among measuring
instruments, regardless of which statistical test is used. Mathematically speaking, an error is
‘systematic’ if its expected value after many repeated measurements is nonzero32. In most
cases, the existence of such errors cannot be inferred from the distribution of repeated
readings but must involve some external standard of accuracy33. Once systematic errors are
corrected, seemingly disparate readings may turn out to stand for compatible outcomes
while apparently convergent readings can prove to mask disagreement. Consequently, it is
impossible to adjudicate questions concerning agreement among measuring instruments
before systematic errors have been corrected.
31 My terminology follows the official vocabulary of the International Bureau of Weights and Measures as published by the Joint Committee for Guides in Metrology. For definitions and discussion of type-A and type-B uncertainties see JCGM (2008, 2008a). Note that the distinction between type-A and type-B uncertainty is unrelated to that of type I vs. type II error. Nor should it be confused with the distinction between random and systematic error.
32 cf. JCGM (2008, 2.17) 33 Some systematic errors can be evaluated purely statistically, such as random walk noise (aka Brownian
noise) in the frequency of electric signals.
53
A well-known example34 concerns glass containers filled with different thermometric
fluids – e.g. mercury, alcohol and air. If one examines the volume indications of these
thermometers when applied to various samples, one discovers that temperature intervals that
are deemed equal by one instrument are deemed unequal by the others. These discrepancies
are stable over many trials and therefore not eliminable through statistical analysis of
repeated measurements. Moreover, because the ratio between corresponding volume
intervals measured by different thermometers is not constant, it is impossible to eliminate
the discrepancy by linear scale transformations such as from Celsius to Fahrenheit.
Nevertheless, from the point of view of scientific methodology these thermometers
may still turn out to be in good agreement once an appropriate nonlinear correction is
applied to their readings. Such numerical correction is often made transparent to users by
manipulating the output of the instrument, e.g. by incorporating the correction into the
gradations on the display. For example, if the thermometers appear to disagree on the
location of the midpoint between the temperatures of freezing and boiling water, the ‘50
Celsius’ marks on their displays may simply be moved so as to restore agreement. Corrective
procedures of this sort are commonplace during calibration and are viewed by scientists as
enhancing the accuracy of measuring instruments. Indeed, in discussions concerning
agreement among measurement outcomes scientists almost never compare ‘raw’, pre-
calibrated indications of instruments directly to each other, a comparison that is thought to
be uninformative and potentially misleading.
What sort of evidence should one look for to decide whether and how much to
correct the indications of measuring instruments? Background assumptions about what the
34 See Mach (1966 [1896]), Ellis (1966, 90-110), Chang (2004, Ch. 2) and van Fraassen (2008, 125-30).
54
instrument is measuring play an important role here. When in 1887 Michelson and Morley
measured the velocity of light beams propagating parallel and perpendicular to the supposed
ether wind they observed little or no significant discrepancy35. Whether this result stands for
agreement between the two values of velocity nevertheless depends on how one represents the
apparatus and its interaction with its environment. Fitzgerald and Lorentz hypothesized that
the arms of the interferometer contracted in dependence on their orientation relative to the
ether wind, an effect that would result in a systematic error that exactly cancels out the
expected difference of light speeds. According to this representation of what the apparatus
was measuring, the seeming convergence in velocities merely masked disagreement. By
contrast, under Special Relativity length contraction is considered not an extraneous
disturbance to the measurement of the velocity of light but a fundamental consequence of
its invariance, and the results are taken to indicate genuine agreement. Hence an effect that
requires systematic correction under one representation of the apparatus is deemed part of
the correct operation of the apparatus under another.
A similar point is illustrated, though under very different theoretical circumstances, by
the development of thermometry in the eighteenth and nineteenth centuries. As noted by
Chang (2004, Ch. 2), by the mid-1700s it was well known that thermometers employing
different sorts of fluids exhibit nonlinear discrepancies. This discovery prompted the
rejection of the naive assumption that the volume indications of all thermometers were
linearly correlated with temperature. Eventually, comparisons among thermometers
(culminating in the work of Henri Regnault in the 1840s) gave rise to the adoption of air
35 Here the comparison is not between different instruments but different operations that involve the same instrument. I take my argument to apply equally to both cases.
55
thermometers as standards. But the adoption of air as a standard thermometric fluid did not
cause other thermometers, such as mercury thermometers, to be viewed as categorically less
accurate. Instead, the adoption of the air standard led to the recalibration of mercury
thermometers under the assumption that their indications are nonlinearly correlated with
temperature. What matters to the accuracy of mercury thermometers under the new
assumption is no longer their linearity but the predictability of their deviation from air
thermometers. The indications of a mercury thermometer could now deviate from linearity
without any loss of accuracy as long as they were predictably correlated with corresponding
indications of a standard. Once again, what is taken to be an error under one representation
of the apparatus is deemed an accurate result under another.
2.2.2. The model-relativity of systematic error
The examples of thermometry and interferometry highlight an important feature of
systematic error: what counts as a systematic error depends on a set of assumptions
concerning what and how the instrument is measuring. These assumptions serve as a basis
for constructing a model of the measurement process, that is, an abstract quantitative
representation of the instrument’s behavior including its interactions with the sample and
environment. The main function of such models is to allow inferences to be made from
indications (or ‘readings’) of an instrument to values of the quantity being measured. While
various types of models are involved in interpreting measurement results, for the sake of the
current discussion it is sufficient to distinguish between models of the data generated by a
measurement process and theoretical models representing the dynamics of a measurement
56
process. Both sorts of models involve assumptions about the measuring instrument, the
sample and environment, but the kind of assumptions is different in either case.
Models of data (or ‘data models’) are constructed out of assumptions about the
relationship between possible values of the quantity being measured, possible indications of
the instrument, and values of extraneous variables, including time36. These assumptions are
used to predict a functional relation between the input and output of an instrument known
as the ‘calibration curve’. We already saw the centrality of data models to the detection of
systematic error in the thermometry example. The initial assumption of linear expansion of
fluids provided a rudimentary calibration curve that allowed inferring temperature from
volume. The linear data model nevertheless proved to be of limited accuracy, beyond which
its predictions came into conflict with the assumption that different instruments measure the
same single-valued quantity, temperature37. Hence a systematic error was detected based on
linear data models and later corrected by constructing more complex data models that
incorporate nonlinearities. In the course of this modeling activity the thermometers in
question are viewed largely as ‘black-boxes’, and very little is assumed about the mechanisms
that cause fluids and gases to expand when heated38.
In the Michelson-Morley example, by contrast, model selection was informed by a
theoretical account of how the apparatus worked. Generally, a theoretical model of a
measuring instrument represents the internal dynamics of the instrument as well as its
interactions with the environment (e.g. ether) and the measured sample (e.g. light beams.)
36 For a general account of models of data see Suppes (1962). 37 See Chang (2004, pp. 89-92) 38 The distinction between data models and theoretical models is closely related to the distinction
between ‘black-box’ and ‘white-box’ calibration tests. For detailed discussion of this distinction see Chapter 4, Sections 4.3 and 4.4. as well as Boumans (2006).
57
The accepted theoretical model of an instrument is crucial in specifying what the instrument is
measuring. The model also determines which behaviors of the instrument count as evidence
for a systematic error. Both of these epistemic functions are clearly illustrated by the
Michelson-Morley example, where the classical model of what the instrument is measuring
(light speed relative to the ether) was replaced with a relativistic model of what the
instrument is measuring (universally constant light speed in any inertial frame.) As part of
this change in the accepted theoretical model of the instrument, the dynamical explanation
of length contraction was replaced with a kinematic explanation. Rather than correct the
effects of length contraction, the new theoretical model of the apparatus conceptually
‘absorbed’ these effects into the value of the quantity being measured.
Despite vast differences between the two examples, they both illustrate the sensitivity
of systematic error to a representational context. That is, in both the thermometry and
interferometry cases the attribution of systematic errors to the indications of the instrument
depends on what the instrument is taken to measure under a given representation.
Furthermore, in both cases the error is corrected (or conceptually ‘eliminated’) merely by
modifying the model of the apparatus and without any physical change to its operation.
The following ‘methodological definition’ of systematic error makes explicit the
model-relativity of the concept:
Systematic error: a discrepancy whose expected value is nonzero between the
anticipated or standard value of a quantity and an estimate of that value based
on a model of a measurement process.
58
This definition is ‘methodological’ in the sense that it pertains to the method by which
systematic errors are detected and estimated. This way of defining systematic error differs
from ‘metaphysical’ definitions of error, which characterize measurement error in relation to
a quantity’s true value. The methodological definition has the advantage of being
straightforwardly applicable to scientific practice, because in most cases of physical
measurement the exact true value of a quantity is unknowable and thus cannot be used to
estimate the magnitude of errors.
Apart from its applicability to scientific practice, the methodological definition of
systematic error has the advantage of capturing all three sorts of ways in which systematic
errors may be corrected, namely (i) by physically modifying the measurement process – for
example, shielding the apparatus from causes of error; (ii) by modifying the theoretical or
data model of the apparatus or (iii) by modifying the anticipated value of the quantity being
measured. In everyday practice (i) and (ii) are usually used in combination, whereas (iii) is
much rarer and may occur due to a revision to the ‘recommended value’ of a constant or due
to changes in accepted theory39. I will discuss the first two sorts of correction in detail below.
The methodological definition of systematic error is still too broad for the purpose of
the current discussion, because it includes errors that can be eliminated simply by changing
the scale of measurement, e.g. by modifying the zero point of the indicator or by converting
from, say, Celsius to Fahrenheit. By contrast, a subset of systematic errors that I will call
‘genuine’ cannot be eliminated in this fashion:
39 The Michelson-Morley example illustrates a combination of (ii) and (iii), as both the theoretical model of the interferometer and the expected outcome of measurement are modified.
59
A genuine systematic error: a systematic error that cannot be eliminated merely
by a permissible transformation of measurement scale.
The possibility of genuine systematic error will prove crucial to the individuation of
quantities across different measuring instruments. Unless otherwise mentioned, the term
‘systematic error’ henceforth denotes genuine systematic errors.
2.2.3. Establishing agreement: a threefold condition
To recapitulate the trajectory of the discussion so far, the need to test whether
different instruments measure the same quantity has led us to look for a test for agreement
among measurement outcomes. We saw that agreement can only be established once
systematic errors have been corrected, and that what counts as a systematic error depends on
how instruments are modeled. Consequently, any test for agreement among measuring
instruments is itself model-relative. The model-relativity of agreement is a direct
consequence of the fact that a change in modeling assumptions may result in a different
attribution of systematic errors to the indications of instruments. For this reason, the results
of agreement tests between measuring instruments may be modified without any physical
change to the apparatus, merely by adopting different modeling assumptions with respect to
the behavior of instruments.
Agreement is therefore established by the convergence of outcomes under specified models
of measuring instruments. Specifically, detecting agreement requires that:
60
(R1) the instruments are modeled as measuring the same quantity, e.g. temperature
or the velocity of light40;
(R2) the indications of each instrument are corrected for systematic errors in light
of their respective models; and
(R3) the corrected indications converge within the bounds of measurement
uncertainty associated with each instrument.
As I have shown in the previous chapter, these requirements are implemented in
practice by measurement experts (or ‘metrologists’) to establish the compatibility of
measurement outcomes41.
Before we examine the epistemological ramifications of these three requirements, a
clarification is in order with respect to requirement (R3). This is the requirement that
convergence be demonstrated within the bounds of measurement uncertainty. The term
‘measurement uncertainty’ is here taken to refer to the overall uncertainty of a given
measurement, which includes not only type-A uncertainty calculated from the distribution of
repeated readings but also type-B uncertainty, the uncertainty associated with estimates of
the magnitudes of systematic errors42. Theoretical models of the apparatus play an important
role in evaluating type-B uncertainties and consequently in deciding what counts as
appropriate bounds for agreement between instruments. For example, the theoretical model
of cesium fountain clocks (the atomic clocks currently used to standardize the unit of time
40 This requirement may be specified either in terms of a quantity type (e.g. velocity) or in terms of a quantity token (e.g. velocity of light). Both formulations amount to the same criterion, for in both cases measurement outcomes are expressed in terms of some quantity token, e.g. a velocity of some thing.
41 See Chapter 1, Section 1.5, as well as the VIM definition of “Compatibility of Measurement Results” (JCGM 2008, 2.47)
42 See Chapter 1, Section 1.4 as well as fn. 31 above.
61
defined as one second) predicts that the output frequency of the clock will be affected by
collisions among cesium atoms. The higher the density of cesium atoms housed by the
clock, the larger the systematic error with respect to the quantity being measured, in this case
the ideal cesium frequency in the absence of such collisions. To estimate the magnitude of
the error, scientists manipulate the density of atoms and then extrapolate their data to the
limit of zero density. The estimated magnitude of the error is then used to correct the raw
output frequency of the clock, and the uncertainty associated with the extrapolation is added
to the overall measurement uncertainty associated with the clock43. This latter uncertainty is
classified as ‘type-B’ because it is derived from a secondary experiment on the apparatus
rather than from a statistical analysis of repeated clock readings.
The conditions under which requirement (R3) is fulfilled are therefore model-relative
in two ways. First, they depend on a theoretical or data model of the measurement process
to establish what counts as a systematic error and therefore what the corrected readings are.
Second, as just noted, these conditions depend on how type-B uncertainties are evaluated,
which again depends on theoretical and statistical assumptions about the apparatus. As the
first two requirements (R1) and (R2) are already explicitly tied to models, the upshot is that
each of the three requirements that together establish agreement among measuring
instruments is model-relative in some respect.
43 See Jefferts et al (2002) for a detailed discussion of this evaluation method.
62
2.2.4. Underdetermination
When the threefold condition above is used as a test for agreement among measuring
instruments, the corrected readings may turn out not to converge within the expected bounds
of uncertainty. In such case disagreement (or incompatibility) is detected between
measurements outcomes. There are accordingly three possible sorts of explanation for such
disagreement:
(H1) the instruments are not measuring the same quantity44;
(H2) systematic errors have been inappropriately evaluated; or
(H3) measurement uncertainty has been underestimated.
How does one determine which is the culprit (or culprits)? Prima facie, one should
attempt to test each of these three hypotheses independently. To test (H1) scientists may
attempt to calibrate the instruments in question against other instruments that are thought to
measure the desired quantity. But this sort of calibration is again a test of agreement. For
calibration to succeed, one must already presuppose under requirement (R1) that the calibrated
and calibrating instrument measure the same quantity. The success of calibration therefore
cannot be taken as evidence for this presupposition. Alternatively, if calibration fails
scientists are faced with the very same problem, now multiplied.
44 Cf. fn. 40. As before, it makes no conceptual difference whether this hypothesis is formulated in terms of a quantity type or a quantity token. The choice of formulation does, however, makes a practical difference to the strategies scientists are likely to employ to resolve discrepancies. See Section 2.4.3 for discussion.
63
Scientists may attempt to test (H2) or (H3) independently of (H1). But this is again
impossible, because the attribution of systematic error involved in claim (H2) is model-
relative and can only be tested by making assumptions about what the instruments are
measuring. Similarly, the evaluation of measurement uncertainty involved in testing (H3)
includes type-B evaluations that are relative to a theoretical model of the measurement
process. Moreover, as (H3) applies only to readings that have already been corrected for
systematic error, it cannot be tested independently of (H2) and ipso facto of (H1).
We are therefore confronted with an underdetermination problem. In the face of
disagreement among measurement outcomes, no amount of empirical evidence can alone
determine whether the procedures in question are inaccurate [(H2 or H3) is true] or whether
they are measuring different quantities (H1 is true). Any attempt to settle the issue by
collecting more evidence merely multiplies the same conundrum. I call this the problem of
quantity individuation. Like other cases of Duhemian underdetermination, it is only a problem
if one believes that there is a disciplined way of deciding which hypothesis to accept (or
reject) based on empirical evidence alone. As we shall see immediately below, several
contemporary philosophical theories of measurement indeed subscribe to this mistaken
belief. That is, they assume that questions of the form ‘what does procedure P measure?’ can
be answered decisively based on nothing more than the results of empirical tests, and
independently of any prior assumptions as to the accuracy of P. This belief lies at the heart
of the foundationalist approach to the notion of measurable quantity, a notion that is viewed
as epistemologically prior to the ‘applied’ challenges involved in making concrete
measurements.
A direct upshot of the problem of quantity individuation is that the individuation of
measurable quantities and the distribution of systematic error are but two sides of the same
64
epistemic coin. Specifically, the possibility of attributing genuine systematic errors to
measurement outcomes (along with relevant type-B uncertainties) is a necessary precondition
for the possibility of establishing the unity of quantities across the various instruments that
measure them. Unless genuine systematic errors are admitted as a possibility when analyzing
experimental results, instruments exhibiting nonscalable discrepancies cannot be taken to
measure the same quantity. Concepts such as temperature and the velocity of light therefore
owe their unity to the possibility of such errors, as do the laws in which such quantities
feature. The notion of measurement error, in other words, has a constructive function in the
elaboration of quantity concepts, a function that has so far remained unnoticed by theories
of measurement.
2.2.5. Conceptual vs. practical consequences
The problem of quantity individuation may strike one as counter-intuitive. Do not
scientists already know that their thermometers measure temperature before they set out to
detect systematic errors? The answer is that scientists often do know, but that their
knowledge is relative to background theoretical assumptions concerning temperature and to
certain traditions of interpreting empirical evidence. Such traditions serve, among other
things, to constrain the range of trade-offs between simplicity and explanatory power that a
scientific community would deem acceptable. Theoretical assumptions and interpretive
traditions inform the choices scientists make among the three hypotheses above. In ‘trivial’
cases of quantity individuation, namely in cases where previous agreement tests have already
been performed among similar instruments under similar conditions with a similar or higher
65
degree of accuracy, an appeal to background theories and traditions is usually sufficient for
determining which of the three hypotheses will be accepted.
As we shall see below, the foundationalist fallacy is to think of such choices as justified
in an absolute sense, that is, outside of the context of any particular scientific theory or
interpretive tradition. Van Fraassen calls this sort of absolutism the ‘view from nowhere’
(2008, 122) and rightly points out that there can be no way of answering questions of the
form ‘what does procedure P measure?’ independently of some established tradition of
theorizing and experimenting. He distinguishes between two sorts of contexts in which such
questions may be answered: ‘from within’, i.e. given the historically available theories and
instrumental practices at the time, or ‘from above’, i.e. retrospectively in light of
contemporary theories.
Although van Fraassen does not discuss the problem of quantity individuation, his
terminology is useful for distinguishing between two different consequences of this problem.
The first, conceptual consequence has already been mentioned: there can be no theory-free test
of quantity individuation. This consequence is not a problem of practicing scientists but only
for conceptual foundationalist accounts of measurement. It stems from the attempt to
devise a test for quantity individuation that would view measurement ‘from nowhere’, prior to
any theoretical assumptions about what is being measured and regardless of any particular
tradition of interpreting empirical evidence.
The other, practical consequence of the problem of quantity individuation is a challenge
for scientists engaged in ‘nontrivial’ measurement endeavors, ones that involve new kinds of
instruments, novel operating conditions or higher accuracy levels than previously achieved
for a given quantity. Exemplary procedures of calibration and error correction may not yet
exist for such measurements. In the face of incompatible outcomes from novel instruments,
66
then, researchers may not have at their disposal established methods for restoring
agreement. Nor can they settle the issue based on empirical evidence from comparison tests
alone, for as the problem of quantity individuation teaches us, such evidence is insufficient
for deciding which one (or more) of the three hypotheses above to accept. The practical
challenge is to devise new methods of adjudicating agreement and error ‘from within’, i.e. by
extending existing theoretical presuppositions and interpretive traditions to a new domain.
As we shall see below, multiple strategies are open to scientists confronted with
disagreement among novel measurements.
Historically, the process of extension has almost always been conservative. Scientists
engaged in cutting-edge measurement projects usually start off by dogmatically supposing that
their instruments will measure a given quantity in a new regime of accuracy or operating
conditions. This conservative approach is extremely fruitful as it leads to the discovery of
new systematic errors and to novel attempts to explain such errors. But such dogmatic
supposition should not be confused with empirical knowledge, because novel measurements
may lead to the discovery of new laws and to the postulation of quantities that are different
from those initially supposed. Instead, this sort of dogmatic supposition can be regarded as a
manifestation of a regulative ideal, an ideal that strives to keep the number of quantity
concepts small and underlying theories simple.
Due to their marked differences, I will consider the two consequences of the problem
of quantity individuation as two distinct problems. The next section will discuss the
conceptual problem and its consequences for foundationalist theories of measurement. The
following section will explain how the conceptual problem is dissolved by adopting a model-
based approach to measurement. I will then return to the practical problem – the problem of
deciding which hypotheses to accept in real, context-rich cases of disagreement – at the end
67
of Section 2.4. If not otherwise mentioned, the ‘problem of quantity individuation’
henceforth refers to the conceptual problem.
2.3. The shortcomings of foundationalism
The conceptual problem of quantity individuation should not come as a surprise to
philosophers of science. It is, after all, a special case of a well-known problem named after
Duhem45. Nevertheless, a look at contemporary works on the philosophy of measurement
reveals that the problem of quantity individuation has so far remained unrecognized. Worse
still, the consequences of this problem are in conflict with several existing accounts of
physical measurement. This section is dedicated to a discussion of the repercussions of the
conceptual problem of quantity individuation for three philosophical theories of
measurement. A by-product of explicating these repercussions is that the problem itself will
be further clarified.
All three philosophical accounts discussed here are empiricist, in the sense that they
attempt to reduce questions about the individuation of quantities to questions about
relations holding among observable results of empirical procedures. These accounts are also
foundationalist insofar as they take universal criteria pertaining to the configuration of
observable evidence to be sufficient for the individuation of quantities, regardless of
theoretical assumptions about what is being measured or local traditions of interpreting
evidence. Hence for a foundational empiricist the result of an individuation test must not
45 Duhem (1991 [1914], 187)
68
depend on any background assumption unless that assumption can be tested empirically. As
I will argue, foundational empiricist criteria individuate quantities far too finely, leading to a
fruitless multiplication of natural laws. Such accounts of measurement are unhelpful in
shedding light on the way quantities are individuated by successful scientific theories.
2.3.1. Bridgman’s operationalism
The first account of quantity individuation I will consider is operationalism as
expounded by Bridgman (1927). Bridgman proposes to define quantity concepts in physics
such as length and temperature by the operation of their measurement. This proposal leads
Bridgman to claim that currently accepted quantity concepts have ‘joints’ where different
operations overlap in their value range or object domain. He warns against dogmatic faith in
the unity of quantity concepts across these ‘joints’, urging instead that unity be checked
against experiments. Bridgman nevertheless concedes that it is pragmatically justified to
retain the same name for two quantities if “within our present experimental limits a
numerical difference between the results of the two sorts of operations has not been
detected” (ibid, 16.)
Bridgman can be said to advance two distinct criteria of quantity individuation, the
first substantive and semantic, the other nominal and empirical. The first criterion is a direct
consequence of the operationalist thesis: quantities are individuated by the operations that
define them. Hence a difference in measurement operation is a sufficient condition for a
difference in the quantity being measured. But even if we grant Bridgman the existence of a
clear criterion for individuating operations, the operationalist approach generates an absurd
69
multiplicity of quantities and laws. Unless ‘operation’ is defined in a question-begging
manner, there is no reason to think that operating a ruler and operating an interferometer
(both used for measuring length) are instances of a single sort of operation. Bridgman, of
course, welcomed the multiplicity of quantity concepts in the spirit of empiricist caution.
Nevertheless, it is doubtful whether the sort of caution Bridgman advised is being served by
his operational analysis of quantity. As long as quantities are defined by operations, no two
operations can measure the same quantity; as a result, it is impossible to distinguish between
results that are ascribable to the objects being measured and those that are ascribable to
some feature of the operation itself, the environment, or the human operator. An
operational analysis, in other words, denies the possibility of testing the objective validity of
measurement claims – their validity as claims about measured objects. This denial stands in
stark contrast to Bridgman’s own cautionary methodological attitude.
Bridgman’s second, empirical criterion of individuation is meant to save physical
theory from conceptual fragmentation. According to the second criterion, quantities are
nominally individuated by the presence of agreement among the results of operations that
measure them. The same ‘nominal quantity’, such as length, is said to be measured by several
different operations as long as no significant discrepancy is detected among the results of
these operations. But this criterion is naive, because different operations that are thought to
measure the same quantity rarely agree with each other before being deliberately corrected
for systematic errors. Such corrections are required, as we have seen, even after one averages
indications over many repeated operations and ‘normalizes’ their scale. The empirical
criterion of individuation therefore fails to avoid the unnecessary multiplicity of quantities.
Alternatively, if by ‘numerical difference’ above Bridgman refers to measurement results that
have already been corrected for systematic errors, such numerical difference can only be
70
evaluated under the presupposition that the two operations measure the same quantity. This
presupposition is nevertheless the very claim that Bridgman needs to establish. This last
reading of Bridgman’s individuation criterion is therefore circular46.
2.3.2. Ellis’ conventionalism
A second and seemingly more promising candidate for an empiricist criterion of
quantity individuation is provided by Ellis in his Basic Concepts of Measurement (1966). Instead
of defining quantity concepts in terms of particular operations, Ellis views quantity concepts
as ‘cluster concepts’ that may be “identified by any one of a large number of ordering
relationships” (ibid, 35). Different instruments and procedures may therefore measure the
same quantity. What is common to all and only those procedures that measure the same
quantity is that they all produce the same linear order among the objects being measured: “If
two sets of ordering relationships, logically independent of each other, always generate the
same order under the same conditions, then it seems clear that we should suppose that they
are ordering relationships for the same quantity” (ibid, 34).
Ellis’ individuation criterion appears at first to capture the examples examined so far.
The thermometers discussed above preserve the order of samples regardless of the
46 Note that my grounds for criticizing Bridgman differ significantly from the familiar line of criticism expressed by Hempel (1966, 88-100). Hempel rejects the proliferation of operational quantity-concepts insofar as it makes the systematization of scientific knowledge impossible. In this respect I am in full agreement with Hempel. But Hempel fails to see that Bridgman’s nominal criterion of quantity individuation is not only opposed to the systematic aims of science but also blatantly circular. Like Bridgman, Hempel wrongly believed that agreement and disagreement among measuring instruments are adjudicated by a comparison of indications, or ‘readings’ (ibid, 92). The circularity of Bridgman’s criterion is exposed only once the focus shifts from instrument indications to measurement outcomes, which already incorporate error corrections. I will elaborate on this distinction below.
71
thermometric fluid used: a sample that is deemed warmer than another by one thermometer
will also be deemed warmer by the others. Similarly, two atomic clocks whose frequencies
are unstable relative to each other still preserve the order of events they are used to record,
barring relativistic effects.
Nevertheless, Ellis’ criterion fails to capture the case of intervals and ratios of
measurable quantities. Quantity intervals and quantity ratios are themselves quantities, and
feature prominently in natural laws. Indeed, Ellis himself mentions the measurement of
time-intervals and temperature-intervals47 and treats them as examples of quantities. As we
have seen, when genuine systematic errors occur, measurement procedures do not preserve
the order of intervals and ratios. Two temperature intervals deemed equal by one
thermometer are deemed unequal by another depending on the thermometric fluid used.
Note that this discrepancy persists far above the sensitivity thresholds of the instruments
and cannot be attributed to resolution limitations.
A similar situation occurs with clocks. Consider two clocks, one whose ‘ticks’ are
slowly increasing in frequency relative to a standard, the other slowly decreasing. Now
imagine that each of these clocks is used to measure the frequency of the standard.
Relativistic effects aside, the speeding clock will indicate that the time intervals marked by
the standard are slowly increasing while the slowing clock will indicate that they are
decreasing – a complete reversal of order of time intervals. Ellis’ criterion is therefore
insufficient to decide whether or not the two clocks measure intervals of the same quantity
(i.e. time.) Considered in light of this criterion alone, the clocks may just as well be deemed
to measure intervals of two different and anti-correlated quantities, time-A and time-B. But
47 Ibid, 44 and 100.
72
this is absurd, and again leaves open the possibility of unnecessary multiplication of
quantities and laws encountered in Bridgman’s case.
As with Bridgman, Ellis cannot defend his criterion by claiming that it applies only to
ordering relationships that have already been appropriately corrected for systematic errors.
For as we have seen, such corrections can only be made in light of the presupposition that
the relevant procedures measure the same quantity, and this is the very claim Ellis’ criterion
is supposed to establish.
Ellis may retort by claiming that his criterion is intended to provide only necessary, but
not sufficient, conditions for quantity individuation. This would be of some consolation if
the condition specified by Ellis’ criterion – namely, the convergence of linear order – was
commonly fulfilled whenever scientists compare measuring instruments to each other. But
almost all comparisons among measuring instruments in the physical sciences are expressed
in terms of intervals or ratios of outcomes, and we saw that Ellis’ criterion is not generally
fulfilled for intervals and ratios. Moreover, virtually all known laws of physics are expressed
in terms of quantity intervals and ratios. The discovery and confirmation of nomic relations,
which are among the primary aims of physics, require individuation criteria that are
applicable to intervals and ratios of quantities, but these are not covered by Ellis’ criterion.
73
2.3.3. Representational Theory of Measurement
Perhaps the best known contemporary philosophical account of measurement is the
Representational Theory of Measurement (RTM)48. Unlike the two previous accounts, RTM
does not explicitly discuss the individuation of quantities. Nevertheless, RTM discusses at
length the individuation of types of measurement scales. A scale type is individuated by the
transformations it can undergo. For example, the Celsius and Fahrenheit scales belong to the
same type (‘interval’ scales) because they have the same set of permissible transformations,
i.e. linear transformations with an arbitrary zero point49. The set of permissible
transformations for a given scale is established by proving a ‘uniqueness theorem’ for that
scale, a proof that rests on axioms concerning empirical relations among the objects
measured on that scale50.
RTM can be used to generate an objection to my analysis of systematic error.
According to this objection, the discrepancies I call ‘genuine systematic errors’ are simply
cases where the same quantity is measured on different scales. For example, the
discrepancies between mercury and alcohol thermometers arise because these instruments
represent temperature on different scales (one may call them the ‘mercury temperature scale’
and ‘alcohol temperature scale’.) RTM shows that these scales belong to the same type –
namely, interval scales. Moreover, RTM supposedly provides us with a conversion factor
that transforms temperature estimates from one scale to the other, and this conversion
eliminates the discrepancies. ‘Genuine systematic errors’, according to this objection, are not
48 Krantz et al, 1971. 49 Ibid, 10. 50 Ibid.
74
errors at all but merely byproducts of a subtle scale difference. RTM eliminates these
byproducts before the underdetermination problem I mention has a chance to arise.
Like the proposals by Bridgman and Ellis, this objection is circular. It purports to
eliminate genuine systematic errors by appealing to differences in measurement scale, but
any test for identifying differences in measurement scale must already presuppose that
genuine systematic errors have been corrected.
This is best illustrated by considering a variant of the problem of quantity
individuation. As before, we may assume that scientists are faced with apparent
disagreement between the outcomes of different measurements. However, in this variant
scientists are entertaining four possible explanations instead of just three:
(H1) the instruments are not measuring the same quantity;
(H1S) measurement outcomes are represented on different scales;
(H2) systematic errors have been inappropriately evaluated; or
(H3) measurement uncertainty has been underestimated.
According to the objection, hypothesis (H1S) can be tested independently of the other
three hypotheses. In other words, facts about the appropriateness and uniqueness of a scale
employed in measurement can be tested independently of questions about what, and how
accurately, the instrument is measuring. This is yet another conceptual foundationalist claim,
i.e. the claim that the concept of measurement scale is fundamental and therefore has
universal criteria of existence and identity.
If taken literally, conceptual foundationalism about measurement scales leads to the
same absurd multiplication of quantities already encountered above. This is because genuine
75
systematic errors by definition cannot be transformed away through alterations of
measurement scale. In the thermometry case, for example, the nonlinear discrepancy
between mercury and alcohol thermometers cannot be eliminated by transformations of the
interval scale, as the latter only admits of linear transformations. One is forced to conclude
that the thermometers are measuring temperature on different types of scales – a ‘mercury
scale type’ and an ‘alcohol scale type’ – with no permissible transformation between them. But this
conclusion is inconsistent with RTM, according to which both scales are interval scales and
hence belong to the same type. How can temperature be measured on two different interval
scales without there being a permissible transformation between them? The only way to
avoid inconsistency is to admit that the so-called ‘thermometers’ are not measuring the same
quantity after all, but two different and nonlinearly related quantities. Hence strict
conceptual foundationalism about measurement scales leads to the same sort of
fragmentation of quantity concepts already familiar from Bridgman and Ellis’ accounts. If
RTM is interpreted along such strict empiricist lines, it can provide very little insight into the
way measurement scales are employed in successful cases of scientific practice.
A second and supposedly more charitable option is to interpret RTM as applying to
indications in the idealized sense, already taking into account error corrections. This is
compatible with the views expressed by authors of RTM, who state that their notion of
‘empirical structure’ should be understood as an idealized model of the data that already
abstracts away from biases51. But on this reading the objection becomes circular. RTM’s
proofs of uniqueness theorems, which according to the objection are supposed to make
51 See Luce and Suppes (2002, 2).
76
corrections to genuine systematic errors redundant, presuppose that these corrections have
already been applied.
Not only the objection, but RTM itself becomes circular under this reading. RTM,
recall, aims to provide necessary and sufficient conditions for the appropriateness and
uniqueness of measurement scales. According to the so-called ‘charitable’ reading just
discussed, these conditions are specified under the assumption that measurement errors have
already been corrected. In other words, any test of (H1S) can only be performed under the
assumption that (H2) and (H3) have already been rejected. But any test of (H2) or (H3) must
already represent measurement outcomes on some measurement scale, for otherwise
quantitative error correction and uncertainty evaluation are impossible. In other words, the
representational appropriateness of a scale type must already be presupposed in the process of
obtaining idealized empirical relations among measured objects.52 Consequently these
empirical relations cannot be used to test the representational appropriateness of the scale
type being used. Instead, (H1S) is epistemically entangled with (H2) and (H3) and ipso facto
with (H1). The project of establishing the appropriateness and uniqueness of measurement
scales based on nothing but observable evidence is caught in a vicious circle.
The so-called ‘charitable’ reading of RTM fails at being charitable enough because it
takes RTM to be an epistemological theory of measurement. Those who read RTM in this light
expect it to provide insight into the way claims to the appropriateness and uniqueness of
measurement scales may be tested by empirical evidence. The authors of RTM occasionally
52 This last point has also been noted by Mari (2000), who claims that “the [correct] characterization of measurement is intensional, being based on the knowledge available about the measurand before the accomplishment of the evaluation. Such a knowledge is independent of the availability of any extensional information on the relations in [the empirical relational structure] RE” (ibid, 74-5, emphases in the originial).
77
make comments that encourage this expectation from their theory53. But this expectation is
unfounded. As we have just seen, the justification for one’s choice of measurement scale
cannot be abstracted away from considerations relating to the acquisition and correction of
empirical data. Any test for the appropriateness of scales that does not take into account
considerations of this sort is bound to be circular or otherwise multiply quantities
unnecessarily. Given that RTM remains silent on considerations relating to the acquisition
and processing of empirical evidence, it cannot be reasonably expected to function as an
epistemological theory of measurement.
Under a third, truly charitable reading, RTM is merely meant to elucidate the
mathematical presuppositions underlying measurement scales. It is not concerned with
grounding empirical knowledge claims but with the axiomatization of a part of the
mathematical apparatus employed in measurement. Stripped from its epistemological guise,
RTM avoids the problem of quantity individuation. But the cost is substantial: RTM can no
longer be considered a theory of measurement proper, for measurement is a knowledge-
producing activity, and RTM does not elucidate the structure of inferences involved in
making knowledge claims on the basis of measurement operations. In other words, RTM
explicates the presuppositions involved in choosing a measurement scale but not the
empirical criteria for the adequacy of these presuppositions. RTM’s role with respect to
measurement theory is therefore akin to that of axiomatic probability theory with respect to
53 For example, the authors of RTM seem to suggest that empirical evidence justifies or confirms the axioms: “One demand is for the axioms to have a direct and easily understood meaning in terms of empirical operations, so simple that either they are evidently empirically true on intuitive grounds or it is evident how systematically to test them.” (Krantz et al 1971, 25)
78
quantum mechanics: both accounts supply rigorous analyses of indispensible concepts (scale,
probability) but not the conditions of their empirical application.
To summarize this section, the foundational empiricist attempt to specify a test of
quantity individuation (or scale type individuation) in terms of nothing more than relations
among observable indications of measuring instruments fails. And fail it must, because
indications themselves are insufficient to determine whether instruments measure different
(but correlated) quantities or the same quantity with some inaccuracy. The next section will
outline a novel epistemology of measurement, one that rejects foundationalism and dissolves
the problem of quantity individuation.
2.4. A model-based account of measurement
2.4.1. General outline
According to the account I will now propose, physical measurement is the coherent
and consistent attribution of values to a quantity in an idealized model of a physical process.
Such models embody theoretical assumptions concerning relevant processes as well as
statistical assumptions concerning the data generated by these processes. The physical
process itself includes all actual interactions among measured samples, instrument, operators
and environment, but the models used to represent such processes neglect or simplify many
of these interactions. It is only in light of some idealized model of the measuring process
that measurement outcomes can be assessed for accuracy and meaningfully compared to
each other. Indeed, it is only against the background of such simplified and approximate
79
representation of the measuring process that measurement outcomes can even be considered
candidates for objective knowledge.
To appreciate this last point in full, it is useful to distinguish between the indications (or
‘readings’) of an instrument and the outcomes of measurements performed with that
instrument. This distinction has already been implicit in the discussion above, but the model-
based view makes it explicit. Examples of indications are the height of a mercury column in
a barometer, the position of a pointer relative to the dial of an ammeter, and the number of
cycles (‘ticks’) generated by a clock during a given sampling period. More generally, an
indication is a property of an instrument in its final state after the measuring process has been
completed. The indications of instruments do not constitute measurement outcomes, and in
themselves are no different than the final states of any other physical process54. What gives
indications special epistemic significance is the fact that they are used for inferring values of
a quantity based on a model of the measurement process, a model that relates possible
indications to possible values of a quantity of interest. These inferred estimates of quantity
values are measurement outcomes. Examples are estimates of atmospheric pressure, electric
current and duration inferred from the abovementioned indications. Measurement outcomes
are expressed on a determinate scale and include associated uncertainties, although
sometimes only implicitly.
A hallmark of the model-based approach to measurement is that models are viewed as
preconditions for obtaining an objective ordering relation among measured objects. We already
saw, for example, that the ordering of time intervals or temperature intervals obtained by
54 Indications may be divided into ‘raw’ and ‘processed’, the latter being numerical representations of the former. Neither processed nor raw indications constitute measurement outcomes. For further discussion see Chapter 4, Section 4.2.2.
80
operating a clock or a thermometer depends on how scientists represent the relationship
between indications and values of the quantity being measured. Such ordering is a
consequence of modeling the instrument in a particular way and assigning systematic
corrections to its indications accordingly. Contrary to empiricist theories of measurement,
then, the ordering of objects with respect to the quantity being measured is never simply
given through observation but must be inferred based on a model of the measuring process.
Prior to such model-based inference, the ‘raw’ ordering of objects by the indications of an
empirical operation is nothing more than a local regularity that may just as plausibly be
ascribed to an idiosyncrasy of the instrument, the environment or the human operator as to
the objects being ordered.
This last claim is not meant as a denial of the existence of theory-free operations for
ordering objects, e.g. placing pairs of objects on the pans of an equal-arms balance.
However, such operations on their own do not yet measure anything, nor is measurement
simply a matter of mapping the results of such operations onto numbers. Measurement
claims, recall, are claims to objective knowledge – meaning that order is ascribed to measured
objects rather than to artifacts of the specific operation being used. Grounding such claim to
objectivity involves differentiating operation-specific features from those that are due to a
pertinent difference among measured samples. As we already saw, different procedures that
supposedly measure the same quantity often produce inconsistent, and in some cases even
completely reversed, ‘raw’ orderings among objects. Such orderings must therefore be
considered operation-specific and cannot be taken as measurement outcomes.
To obtain a measurement outcome from an indication, a distinction must be drawn
between pertinent aspects of the measured objects and procedural artifacts. This involves
the development of what is sometimes called a ‘theory of the instrument’, or more exactly an
81
idealized model of the measurement process, from theoretical and statistical assumptions.
Such models allow scientists to account for the effects of local idiosyncrasies and correct the
outcomes accordingly. Unlike the ‘raw’ order indicated by an operation, the order resulting
from a model-based inference has the proper epistemic credentials to ground objective
claims to measurement, because it is based on coherent assumptions about the object (or
process, or event) being measured.
Not any estimation of a quantity value in an idealized model of a physical process is a
measurement. Rather, a measurement is based on a model that coheres with background
theoretical assumptions, and is consistent with other measurements of the same or related
quantities performed under different conditions. As a result, what counts as an instance of
measurement may change when assumptions about relevant quantities, instruments or
modeling practices are modified.
All measurement outcomes are relative to an abstract and idealized representation of
the procedure by which they were obtained. This explains how the outcomes of a
measurement procedure can change without any physical modification to that procedure,
merely by changing the way the instrument is represented. Similarly, the model-based
approach explains how the accuracy of a measuring instrument can be improved merely by
adding correction terms to the model representing the instrument. Thirdly, the model-
relativity of measurement outcomes explains how the same set of operations, again without
physical change, can be used to measure different quantities on different occasions
depending on the interests of researchers. An example is the use of the same pendulum to
measure either duration or gravitational potential without any physical change to the
pendulum or to the procedures of its operation and observation. The change is effected
merely by a modification to the mathematical manipulation of quantities in the model. For
82
measuring duration, researchers plug in known values for gravitational potential in their
model of the pendulum and use the indications of the pendulum (i.e. number of swings) to
tell the time, whereas measuring gravitational potential involves the opposite mathematical
procedure.
The notions of accuracy and error are similarly elucidated in relation to models. The
accuracy of a measurement procedure is determined by the accuracy of model-based predictions
regarding that procedure’s outcomes. That is, a measurement procedure is accurate relative
to a given model if and only if the model accurately predicts the outcomes of that procedure
under a given set of circumstances55. Similarly, measurement error is evaluated as the
discrepancy between such model-based predictions and standard values of the quantity in
question. Such errors include, but are not limited to, discrepancies that can be estimated by
statistical analysis of repeated measurements. In attributing claims concerning accuracy and
error to predictions about instruments, rather than directly to instruments themselves, the
model-based account makes explicit the inferential nature of accuracy and error (see also
Hon 2009.)56
55 The accuracy of model-based predictions is evaluated by propagating uncertainties from ‘input’ quantities to ‘output’ quantities in the model, as will be clarified in Chapter 4.
56 Commenting on Hertz’ 1883 cathode ray experiments, Hon writes: “The error we discern in Hertz’ experiment cannot be associated with the physical process itself […]. Rather, errors indicate claims to knowledge. An error reflects the existence of an argument into which the physical process of the experiment is cast.” (2009, 21)
83
2.4.2. Conceptual quantity individuation
According to the model-based approach, a physical quantity is a parameter in a theory
of a kind of physical system. Specifically, a measurable physical quantity is a theoretical
parameter whose values can be related in a predictable manner to the final states of one or
more physical processes. A measurable quantity is therefore defined by a background theory
(or theories), which in turn inform the construction of models of particular processes
intended to measure that quantity.
The model-based approach is not committed to a particular metaphysical standpoint
on the reality of quantities. Whether or not quantities correspond to mind-independent
properties is seen as irrelevant to the epistemology of measurement, that is, to an analysis of
the evidential conditions under which measurement claims are justified. This is not meant to
deny that scientists often think of the quantities they measure as representing mind-
independent properties and that this way of thinking is fruitful for the development of
accurate measurement procedures. But whether or not the quantities scientists end up
measuring in fact correspond to mind-independent properties makes no difference to the
kinds of tests scientists perform or the inferences they draw from evidence, for scientists
have no access to such putative mind-independent properties other than through empirical
evidence57. As will become clear below, the model-based approach allows one to talk
coherently about accuracy, error and objectivity as properties of measurement claims without
57 My agnosticism with respect to the existence of mind-independent properties does not, of course, imply agnosticism with respect to the existence of objects of knowledge and the properties they posses qua objects of knowledge. A column of mercury has volume insofar as it can be reliably perceived to occupy space. I therefore accept a modest form of epistemic (e.g. Kantian) realism.
84
committing to any particular metaphysical standpoint concerning the truth conditions of
such claims. The model-based approach does, however, make a distinction among quantities
in terms of their epistemic status. The epistemic status of physical quantities varies from
merely putative to deeply entrenched depending on the demonstrated degree of success in
measuring them. As mentioned, to successfully measure a quantity is to estimate its values in
a consistent and coherent manner based on models of physical processes.
The model-based view provides a straightforward account of quantity individuation
that dissolves the underdetermination problem discussed above. In order to individuate
quantities across measuring procedures, one has to determine whether the outcomes of
different procedures can be consistently modeled in terms of the same parameter in the background
theory. If the answer is ‘yes’, then these procedures measure the same quantity relative to those
models.
A few clarifications are in order. First, by ‘consistently modeled’ I mean that outcomes
of different procedures converge within the uncertainties predicted by their respective
models. A detailed example of this sort of test was discussed in Chapter 1. Second, the
phrase ‘same parameter in the background theory’ requires clarification. A precondition for
even testing whether two instruments provide consistent outcomes is that the outcomes of
each instrument are represented in terms of the same theoretical parameter. By ‘same
theoretical parameter’ I mean a parameter that enters into approximately the same relations
with other theoretical parameters.58 The requirement to model outcomes in terms of the
same theoretical quantity therefore amounts to a weak requirement for nomic coherence among
58 This definition is recursive, but as long as the model has a finite number of parameters the recursion bottoms out. A more general definition is required for models with infinitely many parameters.
85
models specified in terms of that quantity, rather than to a strong requirement for identity of
extension or intension among quantity terms59.
The emphasis on theoretical models may raise worries as to the status of pre-
theoretical measurements. After all, measurements were performed long before the rise of
modern physics. However, even when a full-fledged theory of the measured quantity is
missing or untrustworthy, some pre-theoretical background assumptions are still necessary
for comparing the outcomes of measurements. When in the 1840s Regnault made his
comparisons among thermometers he eschewed all assumptions concerning the nature of
caloric and the conservation of heat (Chang 2004, 77) but he still had to presuppose that
temperature is a single-valued quantity, that it increases when an object is exposed to heat
source, and that an increase of temperature under constant pressure is usually correlated
with expansion. These background assumptions informed the way Regnault modeled his
instruments. Indeed, independently of these minimal assumptions the claim that Regnault’s
instruments measured the same quantity cannot be tested.
To summarize the individuation criterion offered by the model-based approach, two
procedures measure the same quantity only relative to some way of modeling those
procedures, and if and only if their outcomes are shown to be modeled consistently and
coherently in terms of the same theoretical parameter.
It is now time to clarify how this criterion deals with the problem of individuation
outlined earlier in this chapter. As mentioned, the problem of quantity individuation has two
distinct consequences that raise different sorts of challenges: one conceptual and the other
practical. On the conceptual level it is only a problem for foundational accounts of
59 For a recent proposal to individuate quantity concepts in this way see Diez (2002, 25-9).
86
measurement, namely those that attempt to specify theory-free individuation criteria for
measurable quantities. The model-based approach dissolves the conceptual problem by
resisting the temptation to offer foundational criteria of quantity individuation. The identity
of quantities across measurement procedures is relative to background assumptions, either
theoretical or pre-theoretical, concerning what those procedures are meant to measure.
Genuinely discrepant thermometers, for example, measure the same quantity only relative to
a theory of temperature, or before such theory is available, relative to pre-theoretical beliefs
about temperature. Similarly, different cesium atomic clocks measure the same quantity only
relative to some theory of time such as Newtonian mechanics or general relativity, or
otherwise relative to some pre-theoretical conception of time.
Even relative to a given theory metrologists sometimes have a choice as to whether or
not they represent instruments as measuring the same quantity. Relative to general relativity,
for example, atomic clocks placed at different heights above sea level measure the same
coordinate time but different proper times. Quantity individuation therefore depends on which
of these two quantities the clocks are modeled as measuring. The choice among different
ways of modeling a given instrument involves a difference in the systematic correction
applied to its indications. The latter point has already been illustrated in the case of the
Michelson-Morely apparatus, but it holds even in more mundane cases that do not involve
theory change. To return to the clock example, cesium fountain clocks that are represented
as measuring proper time do not require correction for gravitational red-shifts. The
discrepancy among their results is attributed to the fact that they occupy different reference
frames and therefore measure different proper times relative to those frames. On the other
hand, when the same clocks are represented as measuring the same coordinate time on the
87
geoid (an imaginary sphere of equal gravitational potential that roughly corresponds to the
earth’s sea level) their indications need to be corrected for a gravitational red-shift.
As already noted, the distribution of systematic errors among measurement procedures
and the individuation of quantities measured by those procedures are but two sides of the
same epistemic coin. Which side of the coin the scientific community will focus on when
resolving the next discrepancy depends on the particular history of its theoretical and
practical development. This conclusion stands in direct opposition to foundational
approaches, which attempt to provide sufficient conditions for establishing the identity of
measurable quantities independently of any particular scientific theory or experiment. The
model-based approach, by contrast, treats criteria for the individuation of quantities as
already embedded in some theoretical and material setting from the start. Claims concerning
the individuation of quantities are underdetermined by the evidence only in principle, when
such claims are viewed ‘from nowhere.’ But to view such claims independently of their
particular theoretical and material context is to misunderstand how measurement produces
knowledge. Measurement outcomes are the results of model-based inferences, and owe their
objective validity to the idealizing assumptions that ground such inferences. In the absence
of such idealizations, there is no principled way of telling whether discrepancies should be
attributed to the objects being measured or to extrinsic factors.
The search for theory-free criteria of quantity individuation is therefore opposed to the
very supposition that measurement provides objective knowledge. Such foundational
pursuits sprout from a conflation between instrument indications, which constitute the
empirical evidence for making measurement claims, and measurement outcomes, which are
value estimates that constitute the content of these claims. Once the conflation is pointed out,
it becomes clear that the background assumptions involved in inferring outcomes from
88
indications play a necessary and legitimate role in grounding claims about quantity
individuation, whereas the ‘raw’ evidence alone cannot and should not be expected to do so.
2.4.3. Practical quantity individuation
In addition to dissolving the conceptual problem of quantity individuation, the model-
based approach to measurement also sheds light on possible solutions to the practical
problem of quantity individuation, a task that is beyond the purview of other philosophical
theories of measurement. The practical problem, recall, is that of selecting which of the three
hypotheses (H1) – (H3) above to accept when faced with genuinely discrepant measurement
outcomes. Laboratory scientists are habitually confronted with this sort of challenge,
especially if they work in the forefront of accurate measurement where existing standards
cannot settle the issue.
A common solution to the practical problem of quantity individuation is to accept only
(H3), the hypothesis that measurement uncertainty has been underestimated, and enhance
the stated uncertainties so as to achieve compatibility among results. This is equivalent to
extending uncertainty bounds (sometimes mistakenly called ‘error bars’) associated with
different outcomes until the outcomes are statistically compatible. It is common to use
formal measures of statistical compatibility such as the Birge ratio60 to assess the success of
adjustments to stated uncertainties. Agreement is restored either by re-evaluating type-B
uncertainties associated with measuring procedures, by modifying statistical models of noise,
60 See Birge (1932). Henrion & Fischhoff (1986, 792) provide a concise introduction to the Birge ratio.
89
or by increasing the stated uncertainty ad hoc based on ‘educated guesses’ as to which
procedures are less accurate. Regardless of the technique of adjustment, the disadvantage of
accepting only (H3) is that the increase in uncertainty required to recover agreement is
similar in magnitude to the discrepancy among the outcomes. If the discrepancy is large
relative to the initially stated uncertainty, this strategy results in a large increase of stated
uncertainties.
Another option is to accept only (H2), the hypothesis that a systematic bias influences
the outcomes of some of the measurements. In some cases such bias may be corrected by
physically controlling its source, e.g. by shielding against background effects. This strategy is
nevertheless limited by the fact that not all sources of systematic bias are controllable (such
as the presence of nonzero gravitational potential on earth) and that others can only be
controlled to a limited extent. Moreover, for older measurements the apparatus may no
longer be available and attempts to recreate the apparatus may not succeed in reproducing its
idiosyncrasies. For these reasons, systematic biases are often corrected only numerically, i.e.
by modifying the theoretical model of the instrument with a correction factor that reflects
the best estimate of the magnitude of the bias. Because accuracy is ascribable to
measurement outcomes rather than to instrument indications, a model-based correction that
modifies the outcome is a perfectly legitimate tool for enhancing accuracy, even if it has no
effect on the indications of the instrument.
A third strategy for handling the practical challenge of quantity individuation, one that
the model-based approach is especially useful in elucidating, involves accepting all three
hypotheses (H1), (H2) and (H3) – namely, accepting that the instruments (as initially
modeled) measure different quantities, that a systematic error is present that has not been
appropriately corrected and that measurement uncertainties have been underestimated.
90
Agreement is then restored by a method I call ‘unity through idealization’, a method that is
central to the work of metrologists because it restores agreement with a relatively small loss
of accuracy and without necessarily involving physical interventions.
The core idea behind this method is known as Galilean idealization (McMullin 1985).
Galileo’s famous measurements of free-fall acceleration were performed on objects rolling
down inclined planes. This replacement of experimental object was made possible by an
idealization: acceleration on an inclined plane is an imperfect version of free-fall acceleration
in a vacuum. To measure free-fall acceleration, one does not have to experiment on a free-
falling object in a vacuum but merely to conceptually remove the effects of impediments
such as the plane, air resistance etc. from an abstract representation of the rolling object.
More generally, the principle of unity through idealization is this: the same quantity can be
measured in different concrete circumstances so long as these circumstances are represented
as approximations of the same ideal circumstances.
This principle is utilized to restore agreement among seemingly divergent
measurement outcomes. For example, when cesium fountain clocks are found to
systematically disagree in their outcomes, it is occasionally possible to resolve the
discrepancy by further idealizing the theoretical models representing these clocks. The
discrepancy is attributed to the fact that the clocks were not measuring the same quantity in
the less idealized representation. For example, the discrepancy is attributed to the fact that
clocks were measuring different frequencies of cesium, the difference being caused by the
presence of different levels of background thermal radiation. Instead of physically equalizing
the levels of background radiation across clocks, the clocks are conceptually re-modeled so
91
as to measure the ideal cesium frequency in the absence of thermal background, i.e. at a
temperature of absolute zero61. Under this new and further idealized representation of the
clocks, metrologists are justified in applying a correction factor to the model of each clock
that reflects their best estimate of the effect of thermal radiation on the indications of that
clock. This correction involves a type-B uncertainty that is added to the total uncertainty of
each clock, but this new uncertainty is typically much smaller than the discrepancy being
corrected. When successful, this strategy leads to the elimination of discrepancies with only a
small loss of accuracy, and with no physical modification to the apparatus.
2.5. Conclusion: error as a conceptual tool
Philosophers of science have traditionally sought to analyze what they took to be basic
concepts of measurement independently of any particular scientific theory, experimental
tradition or instrument. This approach has proved fruitful for the axiomatization of
measurement scales, but as an approach to the epistemology of measurement, i.e. to the study
of the conditions under which measurement claims are justified in light of possible evidence,
conceptual foundationalism encounters severe limitations. This chapter was dedicated to the
discussion of one such limitation of conceptual foundationalism, namely to its attempt to
answer questions of the form ‘what does procedure P measure?’ independently of questions
of the form ‘how accurate is P?’ As I have shown, the two sorts of questions are
epistemically entangled, such that no empirical test can be devised that would answer one
61 For a detailed discussion of the modeling of cesium fountain clocks see Chapter 1.
92
without at the same time answering the other. Moreover, the choice of answers to both
questions depends on background theories and on traditions of interpreting evidence that
are accepted by the scientific community. Independently of such theories and traditions the
indications of measuring instruments are devoid of epistemic significance, i.e. cannot be
used to ground claims about the objects being measured.
The model-based approach offered here acknowledges the context-dependence of
measurement claims and dissolves the worries of underdetermination associated with
conceptual foundationalism. More importantly, the model-based approach clarifies how the
use of idealizations allows scientists to ground claims to the unity of quantity concepts. The
unity of quantity concepts across different measurement procedures rests on scientists’
success in consistently and coherently modeling these procedures in terms of the same
theoretical parameter. This treatment of quantity individuation clarifies several aspects of
physical measurement that have hitherto been neglected or poorly understood by
philosophers of science, most notably the notion of systematic error. Far from merely being
a technical concern for laboratory scientists, the possibility of systematic error is a central
conceptual tool in coordinating theory and experiment. Genuine systematic errors constitute
the conceptual ‘glue’ that allows scientists to model different instruments in terms of a single
quantity despite nonscalable discrepancies among their indications. The applicability of
quantity concepts across different domains, and hence the generality of physical theory, owe
their existence to the possibility of distributing systematic errors among the indications of
measuring instruments.
93
The Epistemology of Measurement: A Model-Based Account
3. Making Time: A Study in the Epistemology of Standardization
Abstract: Contemporary timekeeping is an extremely successful standardization project, with most national time signals agreeing well within a microsecond. But a close look at methods of clock synchronization reveals a patchwork of ad hoc corrections, arbitrary rules and seemingly circular inferences. This chapter offers an account of standardization that makes sense of the stabilizing role of such mechanisms. According to the model-based account proposed here, to standardize a quantity is to legislate the proper mode of application of a quantity-concept to a collection of exemplary artifacts. This legislation is performed by specifying a hierarchy of models of these artifacts at different levels of abstraction. I show that this account overcomes limitations associated with conventionalist and constructivist explanations for the stability of networks of standards.
3.1. Introduction
The reproducibility of quantitative results in the physical sciences depends on the
availability of stable measurement standards. The maintenance, dissemination and
improvement of standards are central tasks in metrology, the science of reliable measurement.
With the guidance of the International Bureau of Weights and Measures (Bureau International
des Poids et Mesures or BIPM) near Paris, a network of metrological institutions around the
globe is responsible for the ongoing comparison and adjustment of standards.
94
Among the various standardization projects in which metrologists are engaged,
contemporary timekeeping is arguably the most successful, with the vast majority of national
time signals agreeing well within a microsecond and stable to within a few nanoseconds a
month62. The standard measure of time currently used in almost every context of civil and
scientific life is known as Coordinated Universal Time or UTC63. UTC is the product of an
international cooperative effort by time centers that themselves rely on state-of-the-art
atomic clocks spread throughout the globe. These clocks are designed to measure the
frequencies associated with specific atomic transitions, including the cesium transition,
which has defined the second since 1968.
What accounts for the overwhelming stability of contemporary timekeeping standards?
Or, to phrase the question somewhat differently, what factors enable a variety of
standardization laboratories around the world to so closely reproduce Coordinated Universal
Time? The various explanans one could offer in response to this question may be divided
into two broad kinds. First, one could appeal to the natural stability, or regularity, of the
atomic clocks that contribute to world time. Second, one could appeal to the practices by
which metrological institutions synchronize these atomic clocks. The adequate combination
of these two sorts of explanans and the limits of their respective contribution to stability are
contested issues among philosophers and sociologists of science. This chapter will discuss
three accounts of standardization along with the explanations they offer for the stability of
62 Barring time zone and daylight saving adjustments. See BIPM (2011) for a sample comparison of national approximations to UTC.
63 UTC replaced Greenwich Mean Time as the global timekeeping reference in 1972. The acronym ‘UTC’ was chosen as a compromise to avoid favoring the order of initials in either English (CUT) of French (TUC).
95
UTC. Each account will assign different explanatory roles to the social and natural factors
involved in stabilizing timekeeping standards.
The first kind of explanation is inspired by conventionalism as expounded by Poincaré
([1898] 1958), Reichenbach ([1927] 1958) and Carnap ([1966] 1995). According to
conventionalists, metrologists are free to choose which natural processes they use to define
uniformity, namely, to define criteria of equality among time intervals. Prior to this choice,
which is in principle arbitrary, there is no fact of the matter as to which of two given clocks
‘ticks’ more uniformly. The choice of natural process (e.g. solar day, pendulum cycle, or
atomic transition) depends on considerations of convenience and simplicity in the
description of empirical data. Once a ‘coordinative definition’ of uniformity is given, the
truth or falsity of empirical claims to uniformity is completely fixed: how uniformly a given
clock ‘ticks’ relative to currently defined criteria is a matter of empirical fact. In Carnap’s
own words:
If we find that a certain number of periods of process P always match a certain number of periods of process P’, we say that the two periodicities are equivalent. It is a fact of nature that there is a very large class of periodic processes that are equivalent to each other in this sense. (Carnap [1966] 1995, 82-3, my emphasis) We find that if we choose the pendulum as our basis of time, the resulting system of physical laws will be enormously simpler than if we choose my pulse beat. […] Once we make the choice, we can say that the process we have chosen is periodic in the strong sense. This is, of course, merely a matter of definition. But now the other processes that are equivalent to it are strongly periodic in a way that is not trivial, not merely a matter of definition. We make empirical tests and find by observation that they are strongly periodic in the sense that they exhibit great uniformity in their time intervals. (ibid, 84-5, my emphases) Of course, some uncertainty is always involved in determining facts about uniformity
experimentally. But for a conventionalist this uncertainty arises solely from the limited
precision of measurement procedures and not from a lack of specificity in the definition.
96
Accordingly, the stability of contemporary timekeeping is explained by a combination of two
factors: on the social side, the worldwide agreement to define uniformity on the basis of the
frequency of the cesium transition; and on the natural side, the fact that all cesium atoms
under specified conditions have the same frequency associated with that particular transition.
The universality of the cesium transition frequency is, according to conventionalists, a mind-
independent empirical regularity that metrologists cannot influence but may only describe
more or less simply.
The second, constructivist sort of explanation affords standardization institutions
greater agency in the process of stabilization. Standardizing time is not simply a matter of
choosing which pre-existing natural regularity to exploit; rather, it is a matter of constructing
regularities from otherwise irregular instruments and human practices. Bruno Latour and
Simon Schaffer have expressed this position in the following ways:
Time is not universal; every day it is made slightly more so by the extension of an international network that ties together, through visible and tangible linkages, each of all the reference clocks of the world and then organizes secondary and tertiary chains of references all the way to this rather imprecise watch I have on my wrist. There is a continuous trail of readings, checklists, paper forms, telephone lines, that tie all the clocks together. As soon as you leave this trail, you start to be uncertain about what time it is, and the only way to regain certainty is to get in touch again with the metrological chains. (Latour 1987, 251, emphasis in the original) Recent studies of the laboratory workplace have indicated that institutions’ local cultures are crucial for the emergence of facts, and instruments, from fragile experiments. […] But if facts depend so much on these local features, how do they work elsewhere? Practices must be distributed beyond the laboratory locale and the context of knowledge multiplied. Thus networks are constructed to distribute instruments and values which make the world fit for science. Metrology, the establishment of standard units for natural quantities, is the principal enterprise which allows the domination of this world. (Schaffer 1992, 23)
According to Latour and Schaffer, the metrological enterprise makes a part of the
noisy and irregular world outside of the laboratory “fit for science” by forcing it to replicate
97
an order otherwise exhibited only under controlled laboratory conditions. Metrologists
achieve this aim by extending networks of instruments throughout the globe along with
protocols for interpreting, adjusting and comparing these instruments. The fact, then, that
metrologists succeed in stabilizing their networks should not be taken as evidence for pre-
existing regularities in the operation of instruments. On the contrary, the stability of
metrological networks explains why scientists discover regularities outside the laboratory:
these regularities have already been incorporated into their measuring instruments in the
process of their standardization.
This chapter will argue that both conventionalist and constructivist accounts of
standardization offer only partial and unsatisfactory explanations for the stability of
networks of standards. These accounts focus too narrowly on either natural or social
explanans, but any comprehensive picture of stabilization must incorporate both. I will
propose a third, ‘model-based’ alternative to the conventionalist and constructionist views of
standardization, which combines the strengths of the first two accounts and explains how
both natural and social elements are mobilized through metrological practice.
This third approach views standardization as an ongoing activity aimed at legislating
the proper mode of application of a theoretical concept to certain exemplary artifacts. By
‘legislation’ I mean the specification of rules for deciding which concrete particulars fall
under a concept. In the case of timekeeping, metrologists legislate the proper mode of
application of the concept of uniformity of time to an ensemble of atomic clocks. That is,
metrologists specify algorithms for deciding which of the clocks in the ensemble
approximate the theoretical ideal of uniformity more closely. Contrary to the views of
conventionalists, this legislation is not a matter of arbitrary, one-time stipulation. Instead, I
will argue that legislation is an ongoing, empirically-informed activity. This activity is
98
required because theoretical definitions by themselves do not completely determine how the
defined concept is to be applied to particulars. Moreover, I will show that such acts of
legislation are partly constitutive of the regularities metrologists discover in the behavior of
their instruments. Which clocks count as ‘ticking’ more uniformly relative to each other
depends – though only partially – on how metrologists legislate the mode of application of
the concept of uniformity.
A crucial part of legislation is the construction of idealized models of measuring
instruments. As I will argue, legislation proceeds by constructing a hierarchy of idealized
models that mediate between the theoretical definition of the concept and concrete artifacts.
These models are iteratively modified in light of empirical data so as to maximize the
regularity with which concrete instruments are represented under the theoretical concept.
Additionally, instruments themselves are modified in light of the most recent models so as
to maximize regularity further. In this reciprocal exchange between abstract and concrete
modifications, regular behavior is iteratively imposed on the network ‘from above’ and
discovered ‘from below’, leaving genuine room for both natural and social explanans in an
account of stabilization. Acts of legislation are therefore conceived both as constitutive of
the regularities exhibited by instruments and as preconditions for the empirical discovery of
new regularities (or irregularities) in the behaviors of those instruments64.
The first of this chapter’s three sections presents the central methods and challenges
involved in contemporary timekeeping. The second section discusses the strengths and
64 In this respect the model-based account continues the analysis of measurement offered by Kuhn ([1961] 1977). Kuhn took scientific theories to be both constitutive of the correct application of measurement procedures and as preconditions for the discovery of anomalies. The model-based account extends Kuhn’s insights to the maintenance of metrological standards, where local models play a role analogous to theories in Kuhn’s account.
99
limits of conventionalist and constructivist explanations for the stability of metrological
networks, while the third and final section develops the model-based account of
standardization and demonstrates why it provides a more complete and satisfactory
explanation for the stability of UTC than the first two.
3.2. Making time universal
3.2.1. Stability and accuracy
The measurement of time relies predominantly on counting the periods of cyclical
processes, namely clocks. Until the late 1960s, time was standardized by recurrent
astronomical phenomena such as the apparent solar noon, and artificial clocks served only as
secondary standards. Contemporary time standardization relies on atomic clocks, i.e.
instruments that produce an electromagnetic signal that tracks the frequency of a particular
atomic resonance. The two central desiderata for a reliable clock are known in the
metrological jargon as frequency stability and frequency accuracy. The frequency of a clock
is said to be stable if it ticks at a uniform rate, that is, if its cycles mark equal time intervals.
The frequency of a clock is said to be accurate if it ticks at the desired rate, e.g. one cycle per
second.
Frequency stability is, in principle, sufficient for reproducible timekeeping. A collection
of clocks with perfectly stable frequencies would tick at constant rates relative to each other,
and so the readings of any such clock would be sufficient to reproduce the readings of any
100
of the others by simple linear conversion65. A collection of frequency-stable clocks is
therefore also ‘stable’ in the broader sense of the term, i.e. supports the reproducibility of
measurement outcomes. For this reason I will use the term ‘stability’ insofar as it pertains to
collections of clocks without distinguishing between its restricted (frequency-stability) and
broader (reproducibility) senses unless the context requires otherwise.
In practice, no clock has a perfectly stable frequency. The very notion of a stable
frequency is an idealized one, derived from the theoretical definition of the standard second.
Since 1967 the second has been defined as the duration of exactly 9,192,631,770 periods of
the radiation corresponding to a hyperfine transition of cesium-133 in the ground state66. As
far as the definition is concerned, the cesium atom in question is at rest at zero degrees
Kelvin with no background fields influencing the energy associated with the transition.
Under these ideal conditions a cesium atom would constitute a perfectly stable clock. There
are several different ways to construct clocks that would approximate – or ‘realize’ – the
conditions specified by the definition. Different clock designs result in different trade-offs
between frequency accuracy, frequency stability and other desiderata, such as ease of
maintenance and ease of comparison.
Primary realizations of the second are designed for optimal accuracy, i.e. minimal
uncertainty with respect to the rate in which they ‘tick’. As of 2009 thirteen primary
realizations are maintained by leading national metrological laboratories worldwide67. These
clocks are special by virtue of the fact that every known influence on their output frequency
65 barring relativistic effects. 66 BIPM (2006), 113 67 As of 2009, active primary frequency standards were maintained by laboratories in France, Germany,
Italy, Japan, the UK, and the US (BIPM 2010, 33)
101
is controlled and rigorously modelled, resulting in detailed ‘uncertainty budgets.’ The clock
design implemented in most primary standards is the ‘cesium fountain’, so called because it
‘tosses’ cesium atoms up in a vacuum which then fall down due to gravity. This design
allows for a higher signal-to-noise ratio and therefore decreases measurement uncertainty.
The complexity of cesium fountains, however, and the need to routinely monitor their
performance and environment prevents them from running continuously. Instead, each
cesium fountain clock operates for a few weeks at a time, about five times a year. The
intermittent operation of cesium fountain clocks means that they cannot be used directly for
timekeeping. Instead, they are used to calibrate secondary standards, i.e. atomic clocks that are
less accurate but run continuously for years. About 350 such secondary standards are
employed to keep world time68. These clocks are highly stable in the short run, meaning that
the ratios between the frequencies of their ‘ticks’ remain very nearly constant over weeks and
months. But over longer periods the frequencies of secondary standards exhibit drifts, both
relative to each other and to the frequencies of primary standards.
Because neither primary nor secondary standards ‘tick’ at exactly the same rate,
metrologists are faced with a variety of real durations that can all be said to fit the definition
of the second with some degree of uncertainty. Metrologists are therefore faced with the
task of realizing the second based on indications from multiple, and often divergent, clocks.
In tackling this challenge, metrologists cannot simply appeal to the definition of the second
to tell them which clocks are more accurate as it is too idealized to serve as the basis for an
evaluation of concrete instruments. In Chapter 1 I have called this the problem of multiple
68 Panfilo and Arias (2009)
102
realizability of unit definitions and discussed the way this problem is solved in the case of
primary frequency standards.
This chapter focuses on the ways metrologists solve the problem of multiple
realizability in the context of international timekeeping, where the goal is not merely to
produce a good approximation of the second but also to maintain an ongoing measure of
time and synchronize clocks worldwide in accordance with this measure. Timekeeping is an
elaborate task that extends well beyond the evaluation of a handful of carefully maintained
primary standards. It encompasses the global transmission of time signals that enable
coordination in every aspect of civil and scientific life. From communication satellites, to
financial exchanges, to the dating of astronomical observations, Coordinated Universal Time
is meant to guarantee that all of our clocks tell the same time, and it must manage to do so
despite the fact that every clock that maintains UTC ‘ticks’ with a slightly different ‘second’.
From the point of view of relativity theory, UTC is an approximation of terrestrial time,
a theoretically defined coordinate time scale on the earth’s surface69. Ideally, one can imagine
all of the atomic clocks that participate in the production of UTC as located on a rotating
surface of equal gravitational potential that approximates the earth’s sea level. Such surface is
called a ‘geoid’, and terrestrial time is the time a perfectly stable clock on that surface would
tell when viewed by a distant observer. However, much like the definition of the second, the
definition of terrestrial time is highly idealized and does not specify the desired properties of
any concrete clock ensemble. Here again, metrologists cannot determine how well UTC
69 More exactly, it is International Atomic Time (TAI), identical to UTC except for leap seconds, that constitutes a realization of Terrestrial Time.
103
approximates terrestrial time based merely on the latter’s definition, and must compare UTC
to other realizations of terrestrial time.
3.2.2. A plethora of clocks
Let us now turn to the method by which metrologists create a universal measure of
time. At the BIPM near Paris, around 350 secondary standard indications from over sixty
national laboratories are processed. The BIPM receives a reading from each clock every five
days and uses these indications to produce UTC. Coordinated Universal Time is a measure
of time whose scale interval is intended to remain as close as is practically possible to a
standard second. Yet UTC is not a clock; it does not actually ‘tick’, and cannot be
continuously read off the display of any instrument. Instead, UTC is an abstract measure of
time: a set of numbers calculated monthly in retrospect, based on the readings of
participating clocks70. These numbers indicate how late or early each nation’s ‘master time’,
its local approximation of UTC, has been running in the past month. Typically ranging from
a few nanoseconds to a few microseconds, these numbers allow national metrological
institutes to then tune their clocks to internationally accepted time. Table 3.1 is an excerpt
from the monthly publication issued by the BIPM in which deviations from UTC are
reported for each national laboratory.
70 There are many clocks that approximate UTC, of course. As will be mentioned below, the BIPM and national laboratories produce continuous time signals that are considered realizations of UTC. However, UTC itself is an abstract measure and should not be confused with its many realizations.
104
Tab
le 3
.1: E
xcer
pt
from
Cir
cula
r-T
(B
IPM
201
1), a
mon
thly
rep
ort
thro
ugh
wh
ich
th
e In
tern
atio
nal
Bu
reau
of W
eigh
ts a
nd
Mea
sure
s d
isse
min
ates
C
oord
inat
ed U
niv
ersa
l Tim
e (U
TC
) to
nat
iona
l sta
nd
ard
izat
ion
inst
itu
tes.
Th
e n
umb
ers
in t
he
firs
t se
ven
col
um
ns
ind
icat
e d
iffe
ren
ces
in n
anos
econ
ds
bet
wee
n U
TC
an
d e
ach
of
its
loca
l ap
pro
xim
atio
ns.
Th
e la
st t
hre
e co
lum
ns
ind
icat
e ty
pe-
A, t
ype-
B a
nd
tot
al u
nce
rtai
nti
es f
or e
ach
com
par
ison
. (O
nly
dat
a as
soci
ated
wit
h t
he
firs
t tw
enty
lab
orat
orie
s is
sh
own
.)
105
In calculating UTC Metrologists face multiple challenges. First, among the clocks
contributing to UTC almost none are primary standards. As previously mentioned, most
primary standards do not run continuously. Subsequently UTC is maintained by a free-
running ensemble of secondary standards – stable atomic clocks that run continuously for
years but undergo less rigorous uncertainty evaluations than primary standards. Today the
majority of these clocks are commercially manufactured by Hewlett-Packard or one of its
offshoot companies, Agilent and Symmetricom. These clocks have proven to be
exceptionally stable relative to each other, and the number of HP clocks that participate in
UTC has been steadily increasing since their introduction into world timekeeping in the early
1990s. As of 2010 HP clocks constitute over 70 percent of contributing clocks71.
Comparing clocks in different locations around the globe requires a reliable method of
fixing the interval of comparison. This is another major challenge to globalising time. Were
the clocks located in the same room, they could be connected by optical fibres to a counter
that would indicate the difference, in nanoseconds, among their readings every five days.
Over large distances, time signals are transmitted via satellite. In most cases Global
Positioning System (GPS) satellites are used, thereby ‘linking’ the readings of participating
clocks to GPS time. But satellite transmissions are subject to delays, which fluctuate
depending on atmospheric conditions. Moreover, GPS time is itself a relatively unstable
derivative of UTC. These factors introduce uncertainties to clock comparison data known as
time transfer noise. Increasing with its distance from Paris, transfer noise is often much
71 Petit (2004, 208), BIPM (2010, 52-67). A smaller portion of continuously-running clocks are hydrogen masers, i.e. atomic clocks that probe a transition in hydrogen rather than in cesium.
106
larger than the local instabilities of contributing clocks. This means that the stability of UTC
is in effect limited by satellite transmission quality.
3.2.3. Bootstrapping reliability
The first step in calculating UTC involves processing data from hundreds of
continually operating atomic clocks and producing a free-running time scale, EAL (Échelle
Atomique Libre). EAL is an average of clock indications weighted by clock stability. Finding
out which clocks are more stable than others requires some higher standard of stability
against which clocks would be compared, but arriving at such a standard is the very goal of
the calculation. For this reason EAL itself is used as the standard of stability for the clocks
contributing to it. Every month, the BIPM rates the weight of each clock depending on how
well it predicted the weighted average of the EAL clock ensemble in the past twelve months.
The updated weight is then used to average clock data in the next cycle of calculation. This
method promotes clocks that are stable relative to each other, while clocks whose stability
relative to the overall average falls below a fixed threshold are given a weight of zero, i.e.
removed from that month’s calculation. The average is then recalculated based on the
remaining clocks. The process of removing offending clocks are recalculating is repeated
exactly four times in each monthly cycle of calculation72.
Though effective in weeding out ‘noisy’ clocks, the weight updating algorithm
introduces new perils to the stability of world time. First, there is the danger of a positive
72 Audoin and Guinot 2001, 249.
107
feedback effect, i.e. a case in which a few clocks become increasingly influential in the
calculation simply because they have been dominant in the past. In this scenario, EAL would
become tied to the idiosyncrasies of a handful of clocks, thereby increasing the likelihood
that the remaining clocks would drift further away from EAL. For this reason, the BIPM
limits the weight allowed to any clock to a maximum of about 0.7 percent73. The method of
fixing this maximum weight is itself occasionally modified to optimize stability.
Other than positive feedback, another source of potential instability is the abruptness
with which new clock weights are modified every month. Because different clocks ‘tick’ at
slightly different rates, a sudden change in weights results in a sudden change of frequency.
To avoid frequency jumps, the BIPM adds ‘cushion’ terms to the weighted average based on
a prediction of that month’s jump74. A third precautionary measure taken by the BIPM
assigns a zero weight to new clocks for a four month test interval before authorizing them to
exert influence on international time.
The results of averaging depend not only on the choice of clock manufacturer,
transmission method and averaging algorithm, but also on the selection of particular
participating clocks. Only laboratories in nations among the eighty members and associates
of BIPM are eligible for participation in the determination of EAL. Funded by membership
fees, the BIPM aims to balance the threshold requirements of metrological quality with the
financial benefits of inclusiveness. Membership requires national diplomatic relations with
France, the depositary of the intergovernmental treaty known as the Metre Convention
(Convention du Mètre). This treaty authorizes BIPM to standardize industrial and scientific
73 Since 2002, the maximal weight of each clock is limited to 2.5 / N, where N is the number of contributing clocks (Petit 2004, 308).
74 Audoin and Guinot 2001, 243-5.
108
measurement. The BIPM encourages participation in the Metre Convention by highlighting
the advantages of recognized metrological competence in the domain of global trade, and by
offering reduced fees to smaller states and developing countries75. Economic trends and
political considerations thus influence which countries contribute to world time, and
indirectly which atomic clocks are included in the calculation of UTC.
3.2.4. Divergent standards
Despite the multiple means employed to stabilize the weighted average of clock
readings, additional steps are necessary to guarantee stability, due to the fact that the
frequencies of continuously operating clocks tend to drift away from those of primary
standards. In the late 1950s, when atomic time scales were first calculated, they were based
solely on free-running clocks. Over the course of the following two decades, technological
advances revealed that universal time was running too fast: the primary standards that
realized the second were beating slightly slower than the clocks that kept time. To align the
two frequencies, in 1977 the second of UTC was artificially lengthened by one part in 1013.
At this time it was decided that the BIPM would make regular small corrections that would
‘steer’ the atomic second toward its officially realized duration, in at attempt to avoid future
shocks76. This decision effectively split atomic time into two separate scales, each ‘ticking’
with a slightly different second: on the one hand, the weighted average of free-running
75 Quinn (2003) 76 Audoin and Guinot 2001, 250
109
clocks (EAL), and on the other the continually corrected (or ‘steered’) International Atomic
Time, TAI (Temps Atomique International).
The monthly calculation of steering corrections is a remarkable algorithmic feat,
relying upon intermittent calibrations against the world’s ten cesium fountains. These
calibrations differ significantly from one another in quality and duration. Some primary
standards run for longer periods than others, resulting in a better signal; some calibrations
suffer from higher transfer noise; and some of the primary standards involved are more
accurate than others77. For this reason the BIPM assigns weights, or ‘filters’, to each
calibration episode depending on its quality. These checks are still not sufficient. Primary
standards do not agree with one another completely, giving rise to the concern that the
duration of the UTC second could fluctuate depending on which primary standard
contributed the latest calibration. To circumvent this, the steering algorithm is endowed with
‘memory’, i.e. it extrapolates data from past calibration episodes into times in which primary
standards are offline. This extrapolation must itself be time-dependent, as noise limits the
capacity of free-running clocks to ‘remember’ the frequency to which they were calibrated.
The BIPM therefore constructs statistical models for the relevant noise factors and uses
them to derive a temporal coefficient, which is then incorporated into the calculation of
‘filters’78.
This steering algorithm allows metrologists to track the difference in frequency
between free-running clocks and primary standards. Ideally, the difference in frequency
would remain stable, i.e. there would be a constant ratio between the ‘seconds’ of the two
77 See Chapter 1 for a detailed discussion of how the accuracy of primary standards is evaluated. 78 Azoubib et al (1977), Arias and Petit (2005)
110
measures. In this ideal case, requirements for both accuracy and stability would be fulfilled,
and a simple linear transformation of EAL would provide metrologists with a continuous
timescale as accurate as a cesium fountain. In practice, however, EAL continues to drift. Its
second has lengthened in the past decade by a yearly average of 4 parts in 1016 relative to
primary standards79. This presents metrologists with a twofold problem: first, they have to
decide how fast they want to ‘steer’ world time away from the drifting average. Overly
aggressive steering would destabilize UTC, while too small a correction would cause clocks
the world over to slowly diverge from the official (primary) second. Indeed, the BIPM has
made several modifications to its steering policy in the past three decades in at attempt to
optimize both smoothness and accuracy80. The second aspect of the problem is the need to
stabilize the frequency of EAL. One solution to this aspect of the problem is to replace
clocks in the ensemble with others that ‘drift’ to a lesser extent. This task has largely been
accomplished in the past two decades with the proliferation of HP clocks, but some
instability remains. Elimination or reduction of the remaining instability is likely to require
new algorithmic ‘tricks’. The BIPM is currently considering a change to the EAL weighting
method that would involve a more sophisticated prediction of the behaviour of clocks, a
change that is expected to further reduce frequency drifts81.
Disagreements among standards are not the sole condition requiring frequency
steering. Abrupt changes in the ‘official’ duration of the second as realized by primary
standards may also trigger steering corrections. These abrupt changes can occur when
metrologists modify the way in which they model their instruments. For example, in 1996
79 Panfilo and Arias (2009) 80 Audoin and Guinot 2001, 251 81 Panfilo and Arias (2009)
111
the metrological community achieved consensus around the effects of thermal background
radiation on cesium fountains, previously a much debated topic. A new systematic
correction was subsequently applied to primary standards that shortened the second by
approximately 2 parts in 1014. While this difference may seem minute, it took more than a
year of monthly steering corrections for UTC to ‘catch up’ with the suddenly shortened
second82.
3.2.5. The leap second
With the calculation of TAI the task of realizing the definition of the standard second
is complete. TAI is considered to be a realization of terrestrial time, that is, an
approximation of general-relativistic coordinate time on the earth’s sea level. However, a
third and last step is required to keep UTC in step with traditional time as measured by the
duration of the solar day. The mean solar day is slowly increasing in duration relative to
atomic time due to gravitational interaction between the earth and the moon. To keep ‘noon
UTC’ closely aligned with the apparent passage of the sun over the Greenwich meridian, a
leap second is occasionally added to UTC based on astronomical observations. By contrast,
TAI remains free of the constraint to match astronomical phenomena, and runs behind
UTC by an integer number of seconds83.
82 Audoin and Guinot 2001, 251 83 In January 2009 the difference between TAI and UTC was 34 seconds.
112
3.3. The two faces of stability
3.3.1. An explanatory challenge
The global synchronization of clocks in accordance with atomic time is a remarkable
technological feat. Coordinated Universal Time is disseminated to all corners of civil life,
from commerce and aviation to telecommunication, in manner that is seamless to the vast
majority of its users. This achievement is better appreciated when one contrasts it to the
state of time coordination less than a century-and-a-half ago, when the transmission of time
signals by telegraphic cables first became available. Peter Galison (2003) provides a detailed
history of the efforts involved in extending a unified ‘geography of simultaneity’ across the
globe during the 1870s and 1880s, when railroad companies, national observatories, and
municipalities kept separate and conflicting timescales. Today, the magnitude of
discrepancies among timekeeping standards is far smaller than the accuracy required by
almost all practical applications, with the exception of few highly precise astronomical
measurements.
The task of the remainder of this chapter is to explain how metrologists succeed in
synchronizing clocks worldwide to Coordinated Universal Time. What are the sources of
this measure’s efficacy in maintaining global consensus among time centers? An adequate
answer must account for the way in which the various ingredients that make up UTC
contribute to its success. In particular, the function of ad hoc corrections, rules of thumb and
seemingly circular inferences prevalent in the production of UTC requires explanation. What
role do these mechanisms play in stabilizing UTC, and is their use justified from an
epistemic point of view? The importance of this question extends beyond the measurement
113
of time. Answering it will require an account of the goals of standardization projects, the
sort of knowledge such projects produce, and the reasons they succeed or fail. I will begin by
considering two such accounts, namely conventionalism and constructivism, and argue that
they provide only partial and unsatisfactory explanations for the stability of contemporary
timekeeping standards. I will follow this by combining elements of both accounts in the
development of a third, model-based account of standardization that overcomes the
explanatory limitations of the first two.
3.3.2. Conventionalist explanations
Any plausible account of metrological knowledge must attend to the fact that
metrologists enjoy some freedom in determining the correct application of the concepts they
standardize. In order to properly understand the goals of standardization projects one must
first clarify the sources and scope of this freedom. Traditionally, philosophers of science
have taken standardization to consist in arbitrary acts of definition. Conventionalists like
Poincaré and Reichenbach stressed the arbitrary nature of the choice of congruence
conditions, that is, the conditions under which magnitudes of certain quantities such as
length and duration are deemed equal to one another. In his essay on “The Measure of
Time” ([1898] 1958), Poincaré argued against the existence of a mind-independent criterion
of equality among time intervals. Instead, he claimed that the choice of a standard measure
of time is “the fruit of an unconscious opportunism” that leads scientists to select the
simplest system of laws (ibid, 36). Reichenbach called these arbitrary choices of congruence
conditions ‘coordinative definitions’ because they coordinate between the abstract concepts
114
employed by a theory and the physical relations represented by these concepts (Reichenbach
1927, 14). In the case of time, the choice of congruence conditions amounts to a
coordinative definition of uniformity in the flow of time. Coordinative definitions are required
because theories by themselves do not specify the application conditions for the concepts
they define. A theory can only link concepts to one another, e.g. postulate that the concept
of uniformity of time is tied to the concept of uniform motion, but it cannot tell us which
real motions or frequencies count as uniform. For this, Reichenbach claimed, a coordinative
definition is needed that would link the abstract concept of uniformity with some concrete
method of time measurement. Prior to such coordinative definition there is no fact of the
matter as to whether or not two given time intervals are equal (ibid, 116).
The standardization of time, according to classical conventionalists, involves a free
choice of a coordinative definition for uniformity. It is worth highlighting three features of
this definitional sort of freedom as conceived by classical conventionalists. First, it is an a
priori freedom in the sense that its exercise is independent of experience. One may choose
any uniformity criterion as long as the consequences of that criterion do not contradict one
another. Second, it is a freedom only in principle and not in practice. For pragmatic reasons,
scientists select uniformity criteria that make their descriptions of nature as simple as
possible. The actual selection of coordinative definition is therefore strongly, if not uniquely,
constrained by the results of empirical procedures. Third, definitional freedom is singular in
the sense that it is completely exhausted by a single act of exercising it. Though a definition
can be replaced by another, each such replacement annuls the previous definition. In this
respect acts of definition are essentially ahistorical.
In the case of contemporary timekeeping, the definition of the second functions as a
coordinative definition of uniformity. Recall that the definition of the second specifies that
115
the period associated with a particular transition of the cesium atom is constant, namely, that
the cycles of the electromagnetic radiation associated with this transition are equal to each
other in duration. The definition of the second, in other words, fixes not only a unit of time
but also a criterion for the congruence of time intervals. In order to make this uniformity
criterion consistent across different relativistic reference frames, the cesium atom is said to
lie on the earth’s approximate sea level. The resulting coordinate timescale, terrestrial time,
provides a universal definition of uniformity while conveniently allowing earth-bound clocks
to approximate it.
According to conventionalists, once a coordinative definition of uniformity is chosen
the equality or inequality of durations is a matter of empirical fact. As the passage quoted
above from Carnap makes clear, the remaining task for metrologists is only to discover which
clocks ‘tick’ at a more stable rate relative to the chosen definition of uniformity and to
improve those clocks that were found to be less stable. Conventionalists, in other words,
explain the stability of networks of standards in naturalistic terms. A naturalistic explanation
for the stability of a network of standards is one that ultimately appeals to an underlying
natural regularity in the properties or behaviors of those standards. In the case of time
measurement, a conventionalist would claim that standardization is successful because the
operation of atomic clocks relies on an empirical regularity, namely the fact that the
frequency associated with the relevant transition is roughly the same for all cesium-133
atoms. This regularity may be described in ways that are more or less simple depending on
one’s choice of coordinative definition, but the empirical facts underlying it are independent
of human choice. Accordingly, a conventionalist explanation for the success of the
stabilizing mechanisms employed in the calculation of UTC is that these mechanisms make
UTC a reliable indicator of an underlying regularity, namely the constancy of the frequency
116
associated with different concrete cesium atoms used by different clocks84. Supposedly,
metrologists are successful in synchronizing clocks to UTC because the algorithm that
calculates UTC detects those clocks that ‘tick’ closer to the ideal cesium frequency and
distributes time adjustments accordingly.
The idea that UTC is a reliable indicator of a natural regularity gains credence from the
fact that UTC is gradually ‘steered’ towards the frequency of primary standards. As
previously mentioned, primary frequency standards are rigorously evaluated for uncertainties
and compared to each other in light of these evaluations. The fact that the frequencies of
different primary standards are consistent with each other within uncertainty bounds can be
taken as an indication for the regularity of the cesium frequency. Assuming, as metrologists
do85, that the long-term stability of UTC over years is due mostly to ‘steering’, one can
plausibly make the case that the algorithm that produces UTC is a reliable detector of a
natural regularity in the behavior of cesium atoms.
This nevertheless leaves unexplained the success of the mechanisms that keep UTC
stable in the short-term, i.e. when UTC is averaged over weeks and months. These
mechanisms include, among others, the ongoing redistribution of clock weights, the limiting
of maximum weight, the ‘slicing’ of steering corrections into small monthly increments and
the increasingly exclusive reliance on Hewlett-Packard clocks.
One way of accounting for these short-term stabilizing mechanisms is to treat them as
tools for facilitating consensus among metrological institutions. I will discuss this approach
84 This is a slight over-simplification, because not all the clocks that contribute to UTC are cesium clocks. As mentioned, some are hydrogen masers. The ‘regularity’ in question can therefore be taken more generally to be the constancy of frequency associated with any given atomic transition in some predefined set.
85 Audoin and Guinot 2001, 251
117
in the next subsection. Another option would be to look for a genuine epistemic function
that these mechanisms serve. To a conventionalist (as to any other naturalist), this means
finding a way of vindicating these self-stabilizing mechanisms as reliable indicators of an
underlying natural regularity. Because a reliable indicator is one that is sensitive to the
property being indicated, one should expect the relevant stabilizing mechanisms to do less
well when such regularity is not strongly supported by the data. In practice, however, no such
degradation in stability occurs. On the contrary, short-term stabilization mechanisms are
designed to be as insensitive to frequency drifts or gaps in the data as is practically possible.
It is rather the data that is continually adjusted to stabilize the outcome of the calculation. As
already mentioned, whenever a discrepancy among the frequencies of different secondary
standards persists for too long it is eliminated ad hoc, either by ignoring individual clocks or
by eventually replacing them with others that are more favorable to the stability of the
average. Frequency ‘shocks’ introduced by new clocks are numerically cushioned. Even
corrections towards primary standards, which are supposed to increase accuracy, are spread
over a long period by slicing them into incremental steering adjustments or by embedding
them in a ‘memory-based’ calculation.
The constancy of the cesium period in the short-term is therefore not tested by the
algorithm that produces UTC. For a test implies the possibility of failure, whereas the
stabilizing mechanisms employed by the BIPM in the short-term are fail-safe and intended
to guard UTC against instabilities in the data. Indeed, there is no sign that metrologists even
attempt to test the ‘goodness of fit’ of UTC to the individual data points that serve as the
input for the calculation, let alone that they are prepared to reject UTC if it does not fit the
data well enough. Rather than a hypothesis to be tested, the stability of the cesium period is
a presupposition that is written into the calculation from the beginning and imposed on the
118
data that serves as its input. This seemingly question-begging practice of data analysis
suggests either that metrological methods are fundamentally flawed or that the
conventionalist explanation overlooks some important aspect of the way UTC is supposed
to function. In Section 3.4 I will argue that the latter is the case, and that the seeming
circularity in the calculation of UTC dissolves once the normative role of models in
metrology is acknowledged.
3.3.3. Constructivist explanations
As we learned previously, UTC owes its short-term stability not to the detection of
regularities in underlying clock data, but rather to the imposition of a preconceived regularity
on that data. This regularity, i.e. the frequency stability of participating clocks relative to
UTC, is imposed on the data through weighting adjustments, time steps and frequency
corrections implemented in the various stages of calculation. Constructivist explanations for
the success of standardization projects make such regulatory practices their central
explanans. According to Latour and Schaffer (quoted above), the stability of global
timekeeping is explained by the ongoing efforts of metrological institutions to harness clocks
into synchronicity. Particularly, standard clocks agree about the time because metrologists
maintain a stable consensus as to which clocks to use and how the readings of these clocks
should be corrected. The stability of consensus is in turn explained by an international
bureaucratic cooperation among standardization institutes. To use Latour’s language, the
stability of the network of clocks depends on an ongoing flux of paper forms issued by a
network of calculation centers. When we look for the sources of regularity by which these
119
forms are circulated we do not find universal laws of nature but international treaties, trade
agreements and protocols of meetings among clock manufacturers, theoretical physicists,
astronomers and communication engineers. Without the efforts and resources continuously
poured into the metrological enterprise, atomic clocks would not be able to tell the same
time for very long.
From a constructivist perspective, the algorithm that produces UTC is a particularly
efficient mechanism for generating consensus among metrologists. Recall that Coordinated
Universal Time is nothing over and above a list of corrections that the BIPM prescribes to
the time signals maintained by local standardization institutes. By administering the
corrections published in the monthly reports of the BIPM, metrologists from different
countries are able to reach agreement despite the fact that their clocks ‘tick’ at different rates.
This agreement is not arbitrary but constrained by the need to balance the central authority
of the International Bureau with the autonomy of national institutes. The need for a trade-
off between centralism and autonomy accounts for the complexity of the algorithm that
produces UTC, which is carefully crafted to achieve a socially optimal compromise among
metrologists. A socially optimal compromise is one that achieves consensus with minimal
cost to local metrological authorities, making it worthwhile for them to comply with the
regulatory strictures imposed by the BIPM. Indeed, the algorithm is designed to distribute
the smallest adjustments possible among as many clocks as possible. Consequently, the
overall adjustments required to approximate UTC at any given local laboratory is kept to a
minimum.
In stressing the importance of ongoing negotiations among metrological institutions,
constructivists do not yet diverge from conventionalists, who similarly view the comparison
and adjustment of standards as prerequisites for the reproducibility of measurement results.
120
But constructivists go a step further and, unlike conventionalists, refuse to invoke the
presence of an underlying natural regularity in order to explain the stability of timekeeping
standards86. On the contrary, they remind us that regularity is imposed on otherwise
discrepant clocks for the sake of achieving commercial and economic goals. Only after the
fact does this socially-imposed regularity assume the appearance of a natural phenomenon.
Latour expresses this view by saying that “[t]ime is not universal; every day it is made slightly
more so by the extension of an international network [of standards]” (1987, 251). Schaffer
similarly claims that facts only “work” outside of the laboratory because metrologists have
already made the world outside of the laboratory “fit for science”(1992, 23). According to
these statements, if they are taken literally, quantitative scientific claims attain universal
validity not by virtue of any preexisting state of the world, but by virtue of the continued
efforts of metrologists who transform parts of the world until they reproduce desired
quantitative relations87. In what follows I will call this the reification thesis.
The reification thesis is a claim about the sources of regularity exhibited by
measurement outcomes outside of the carefully controlled conditions of a scientific
laboratory. This sort of regularity, constructivists hold, is constituted by the stabilizing
practices carried out by metrologists rather than simply discovered in the course of carrying
out such practices. In other words, metrologists do not simply detect those instruments and
methods that issue reproducible outcomes; rather, they enforce a preconceived order on
otherwise irregular instruments and methods until they issue sufficiently reproducible
86 Ian Hacking identifies explanations of stability as one of three ‘sticking points’ in the debate between social constructivists and their intellectual opponents (1999, 84-92)
87 These claims echo Thomas Kuhn’s in his essay “The Function of Measurement in Modern Physical Science” ([1961] 1977).
121
outcomes. Note that the reification thesis entails an inversion of explanans and
explanandum relative to the conventionalist account. It is the successful stabilization of
metrological networks that, according to Latour and Schaffer, explains universal regularities
in the operation of instruments rather than the other way around.
How plausible is this explanatory inversion in the case of contemporary timekeeping?
As already hinted at above, the constructivist account fits well with the details of the case
insofar as the short-term stability of standards is involved. In the short run, the UTC
algorithm does not detect frequency stability in the behavior of secondary standards but
imposes stability on their behavior. Whenever a discrepancy arises among different clocks it
is eliminated by ad hoc correction or by replacing some of the clocks with others. The ad hoc
nature of these adjustments guarantees that any instability, no matter how large, can be
eliminated in the short run simply by redistributing instruments and ‘paper forms’
throughout the metrological network.
The constructivist account is nevertheless hard pressed to explain the fact that the
corrections involved in maintaining networks of standards remain small in the long run. An
integral part of what makes a network of metrological standards stable is the fact that its
maintenance requires only small and occasional adjustments rather than large and frequent ones.
A network that reverted to irregularity too quickly after its last recalibration would demand
constant tweaking, making its maintenance ineffective. This long-term aspect of stability is
an essential part of what constitutes a successful network of standards, and is therefore in
need of explanation no less than its short-term counterpart. After all, nothing guarantees
that metrologists will always succeed in diminishing the magnitude and frequency of
corrections they apply to networks of instruments. How should one explain their success,
then, in those cases when they so succeed? Recall that the conventionalist appealed to
122
underlying regularities in nature to explain long-term stability: metrologists succeed in
stabilizing networks because they choose naturally stable instruments. But this explanatory
move is blocked for those who, like Latour and Schaffer, hold to the reification thesis with
its requirement of explanatory inversion.
To illustrate this point, imagine that metrologists decided to keep the same algorithm
they currently use for calculating UTC, but implemented it on the human heart as a standard
clock instead of the atomic standard88. As different hearts beat at different rates depending
on the particular person and circumstances, the time difference between these organic
standards would grow rapidly from the time of their latest correction. Institutionally
imposed adjustments would only be able to bring universal time into agreement for a short
while before discrepancies among different heart-clocks exploded once more. The same
algorithm that produces UTC would be able to minimize adjustments to a few hours per
month at best, instead of a few nanoseconds when implemented with atomic standards. In
the long run, then, the same mechanism of social compromise would generate either a highly
stable, or a highly unstable, network depending on nothing but the kind of physical process
used as a standard. Constructivists who work under the assumption of the reification thesis
cannot appeal to natural regularities in the behavior of hearts or cesium atoms as primitive
explanans, and would therefore be unable to explain the difference in stability.
Constructivists may respond by claiming that, for contingent historical reasons,
metrologists have not (yet) mastered reliable control over human hearts as they have over
cesium atoms. This is a historical fact about humans, not about hearts or cesium atoms.
However, even if this claim is granted, it offers no explanation for the difference in long-
88 A similar imaginary exercise is proposed by Carnap ([1966] 1995), pp. 80-84.
123
term stability but only admits the lack of such an explanation. Another possibility is for
constructivists to relax the reification thesis, and claim that metrologists do detect
preexisting regularities in the behavior of their instruments, but that such regularities do not
sufficiently explain how networks of standards are stabilized. Under this ‘moderate’ reification
thesis, constructivists admit that a combination of natural and socio-technological explanans
is required for the stability of metrological networks. The question then arises as to how the
two sorts of explanans should be combined into a single explanatory account. The following
section will provide such an account.
3.4. Models and coordination
3.4.1. A third alternative
As we have seen, conventionalists and constructivists agree that claims concerning
frequency stability are neither true nor false independent of human agency, but disagree
about the scope and limits of this agency. Conventionalists believe that human agency is
limited to an a priori freedom to define standards of uniformity. For example, the statement:
‘under specified conditions, the cesium transition frequency is constant’ is a definition of
frequency constancy. Once a choice of definition is made, stabilization is a matter of
discovering which clocks agree more closely with the chosen definition and improving those
clocks that do not agree closely enough. Hence the claim: ‘during period T1…T2, clock X
ticked at a constant frequency relative to the current definition of uniformity’ is understood
as an empirical claim whose truth or falsity cannot be modified by metrologists.
124
Constructivists argue instead that judgments about frequency stability cannot be
abstracted away from the concrete context in which they are made. Claims to frequency
stability are true or false only relative to a particular act of comparison among clocks, made
at a particular time and location in an ever changing network of instruments, protocols and
calculations. As evidenced in detail above, the metrological network of timekeeping
standards is continually rebalanced in light of considerations that have little or nothing to do
with the theoretical definition of uniformity. Quite apart from any ideal definition, de facto
notions of uniformity are multiple and in flux, being constantly modified through the actions
of standardization institutions. If claims to frequency stability appear universal and context-
free, it is only because they rely on metrological networks that have already been successfully
stabilized and ‘black-boxed’ so as to conceal their historicity.
In an attempt to reconcile the two views, one may be tempted to simply juxtapose
their explanans. One would adopt a conventionalist viewpoint to explain the long-term
stability of networks of standards and a constructivist viewpoint to explain short-term
stability. But such juxtaposition would be incoherent, because the two viewpoints make
contradictory claims. As already mentioned, constructivists like Latour and Schaffer reject
the very idea of pre-existing natural regularity, an idea that lies at the heart of
conventionalists explanations of stability. Any attempt to use elements of both views
without reconciling their fundamental tension can only provide an illusion of explanation.
The philosophical challenge, then, is to clarify exactly how constructivism can be ‘naturalized’
and conventionalism ‘socialized’ in a manner that explains both long- and short-term
stability. Meeting this challenge requires developing a subtler notion of natural regularity
than either view offers.
125
The model-based account of standardization that I will now propose does exactly that.
It borrows elements from both conventionalism and constructivism while modifying their
assumptions about the sources of regularity in both nature and society. As I will argue, this
account successfully explains both the long- and short-term stability of metrological
networks without involving contradictory suppositions.
The model-based account may be summarized by the following four claims:
(i) The proper way to apply a theoretical concept (e.g. the concept of uniformity
of time) depends not only on its definition but also on the way concrete
instruments are modeled in terms of that concept both theoretically and statistically;
(ii) Metrologists are to some extent free to influence the proper mode of
application of the concepts they standardize, not only through acts of
definition, but also by adjusting networks of instruments and by modifying
their models of these instruments;
(iii) Metrologists exercise this freedom by continually shifting the proper mode of
application of the concepts they standardize so as to maximize the stability of
their networks of standards;
(iv) In the process of maximizing stability, metrologists discover and exploit
empirical regularities in the behavior of their instruments.
In what follows I shall argue for each of these four claims and illustrate them in the
special case of contemporary timekeeping. In so doing I will show that the model-based
approach does a better job than the previous two alternatives at explaining the stability of
metrological standards.
126
3.4.2. Mediation, legislation, and models
The central goal of standardizing a theoretical concept, according to the model-based
approach, is to regulate the application of the concept to concrete particulars. A
standardization project is successful when the application of the concept is universally
consistent and independent of factors that are deemed local or irrelevant. In conventionalist
jargon, standardization projects ‘coordinate’ a theoretical concept to exemplary particulars.
But in the model-based approach such coordination is not exhausted by arbitrary acts of
definition. If coordination amounted to a kind of stipulative act as Reichenbach believed, the
correct way to apply theoretical concepts to concrete particulars would be completely
determinate once this stipulation is given. This is clearly not the case. Consider the
application of the concept of terrestrial time to a concrete cesium clock: the former is a
highly abstract concept, namely the timescale defined by the ‘ticks’ of a perfectly accurate
cesium clock on the ideal surface of the rotating geoid; the latter is a machine exhibiting a
myriad of imperfections relative to the theoretical ideal. How is one to apply the notion of
terrestrial time to the concrete clock, namely, decide how closely the concrete clock ‘ticks’
relative to the ideal terrestrial timescale? The definition of terrestrial time offers a useful
starting point, but on its own is far too abstract to specify a method for evaluating the
accuracy of any clock. Considerable detail concerning the design and environment of the
127
concrete clock must be added to the definition before the abstract concept can be
determinately applied to evaluate the accuracy of that clock89.
This adding of detail amounts, in effect, to the construction of a hierarchy of models of
concrete clocks at differing level of abstraction. At the highest level of this hierarchy we find
the theoretical model of an unperturbed cesium atom on the geoid. As mentioned, this
model defines the notion of terrestrial time, the theoretical timescale that is realized by
Coordinated Universal Time.
At the very bottom of this hierarchy lie the most detailed and specific models
metrologists construct of their apparatus. These models typically represent the various
systematic effects and statistical fluctuations influencing a particular ensemble of atomic
clocks housed in one standardization laboratory. These models are used for the calculation
of local approximations to UTC.
Mediating between these levels is a third model, perhaps more aptly termed a cluster of
theoretical and statistical models, grounding the calculation of UTC itself. The models in this
cluster are abstract and idealized representations of various aspects of the clocks that
contribute to UTC and their environments. Among these models, for example, are several
statistical models of noise (e.g. white noise, flicker noise and Brownian noise) as well as
simplified representations of the properties of individual clocks (weights, ‘filters’) and
properties of the ensemble as a whole (‘cushion’ terms, ‘memory’ terms.) Values of the
parameter called ‘Coordinated Universal Time’ are determined by analyzing clock data from
the past month in light of the assumptions of models in this cluster.
89 As I have shown in Chapter 1, the accuracy of measurement standards can only be evaluated once the definition of the concept being standardized is sufficiently de-idealized.
128
It is to this parameter, ‘Coordinated Universal Time’, that the concept of terrestrial
time is directly coordinated, rather than to any concrete clock90. Like Reichenbach, I am
using the term ‘coordination’ here to denote an act that specifies the mode of application of
an abstract theoretical concept. But the form that coordination takes in the model-based
approach is quite different than what classical conventionalists have envisioned. Instead of
directly linking concepts with objects (or operations), coordination consists in the
specification of a hierarchy among parameters in different models. In our case, the hierarchy
links a parameter (terrestrial time) in a highly abstract and simplified theoretical model of the
earth’s spacetime to a parameter (UTC) in a less abstract, theoretical-statistical cluster of
models of certain atomic clocks. UTC is in turn coordinated to a myriad of parameters
(UTC(k)) representing local approximations of UTC by even more detailed, lower-level
models.
Finally, the particular clocks that standardize terrestrial time are subsumed under the
lowest-level models in the hierarchy. I am using the term ‘subsumed under’ rather than
‘described by’ because the accuracy of a concrete clock is evaluated against the relevant low-
level model and not the other way around. This is an inversion of the usual way of thinking
about approximation relations. In most types of scientific inquiry abstract models are meant
to approximate their concrete target systems. But the models constructed during
standardization projects have a special normative function, that of legislating the mode of
application of concepts to concrete particulars. Indeed, standardization is precisely the
legislation of a proper mode of application for a concept through the specification of a
90 More exactly, the concept of terrestrial time is directly coordinated to TAI, i.e. to UTC prior to the addition of ‘leap seconds’ (see the discussion on ‘leap second’ in Section 3.2.5.)
129
hierarchy of models. At each level of abstraction, the models specify what counts as an
accurate application of the standardized concept at the level below.
Figure 3.1: A simplified hierarchy of approximations among model parameters in contemporary timekeeping. Vertical position on the diagram denotes level of abstraction and arrows denote
approximation relations. Note that concrete levels approximate abstract ones.
Consequently, the chain of approximations (or ‘realizations’) runs upwards in the
hierarchy rather than downwards: concrete clocks approximate local estimates of UTC,
which in turn approximate UTC as calculated by the International Bureau, which in turn
approximates the ideal timescale known as terrestrial time. Figure 3.1 summarizes the
various levels of abstraction and relations of approximation involved in contemporary
atomic timekeeping.
130
The inversion of approximation relations explains why metrologists deal with
discrepancies in the short run by adjusting clocks rather than by modifying the algorithm
that calculates UTC. If UTC were an experimental best-fit to clock indications, the practice
of correcting and excluding clocks would be suspect of question-begging. However, the goal
of the calculation is not to approximate clock readings, but to legislate the way in which
those readings should be corrected relative to the concept being standardized, namely
uniform time on the geoid (i.e. terrestrial time). The next subsection will clarify why
metrologists are free to perform such legislation.
3.4.3. Coordinative freedom
Equipped with a more nuanced account of coordination than that offered by
conventionalists, we can now proceed to examine how metrological practices influence the
mode of application of concepts. Conventionalists, recall, took the freedom involved in
coordination to be a priori, in principle and singular. According to the model-based account,
metrologists who standardize concepts enjoy a different sort of freedom, one that is
empirically constrained and practically exercised in an ongoing manner. Specifically,
metrologists are to some extent free to decide not only how they define an ideal
measurement of the quantity they are standardizing, but also what counts as an accurate concrete
approximation (or ‘realization’) of this ideal.
The freedom to choose what counts as an accurate approximation of a theoretical ideal
is special to metrology. It stems from the fact that, in the context of a standardization
project, the distribution of errors among different realizations of the quantity being
131
standardized is not completely determinate. Until metrologists standardize a quantity-
concept, its mode of application remains partially vague, i.e. some ambiguity surrounds the
proper way of evaluating errors associated with measurements of that quantity. Indeed, in
the absence of such ambiguity standardization projects would be not only unnecessary but
impossible. Nevertheless, ambiguity of this sort cannot be dissolved simply by making more
measurements, as a determinate standard for judging what would count as a measurement
error is the very thing metrologists are trying to establish. This problem of indeterminacy is
illustrated most clearly in the case of systematic error91.
The inherent ambiguity surrounding the distribution of errors in the context of
standardization projects leaves metrologists with some freedom to decide how to distribute
errors among multiple realizations of the same quantity. Consequently, metrologists enjoy
some freedom in deciding how to construct models that specify what counts as an ideal
measurement of the quantity they are standardizing in some local context. Concrete
instruments are then subsumed under these idealized models, and errors are evaluated
relative to the chosen ideal.
Metrologists make use of this freedom to fit the mode of application of the concept to
the goals of the particular standardization project at hand. In some cases such goals may be
‘properly’ cognitive, e.g. the reduction of uncertainty, a goal which dominates choices of
primary frequency realizations. But in general there is no restriction on the sort of goals that
may inform choices of realization, and they may include economic, technological and
political considerations.
91 For a detailed argument to this effect see Chapter 2 of this thesis, “Systematic Error and the Problem of Quantity Individuation.”
132
The freedom to represent and distribute errors in accordance with local and pragmatic
goals explains why metrologists allow themselves to introduce seemingly self-fulfilling
mechanisms to stabilize UTC. Rather than ask: ‘how well does this clock approximate
terrestrial time?’ metrologists are, to a limited extent, free to ask: ‘which models should we
use to apply the concept of terrestrial time to this clock?’ In answering the second question
metrologists enjoy some interpretive leeway, which they use to maximize the short-term
stability of their clock ensemble. This is precisely the role of the algorithmic mechanisms
discussed above. These self-stabilizing mechanisms do not require justification for their
ability to approximate terrestrial time because they are legislative with respect to the
application of the concept of terrestrial time to begin with. UTC is successfully stabilized in
the short run not because its calculation correctly applies the concept of terrestrial time to
secondary standards; rather, UTC is chosen to determine what counts as a correct
application of the concept of terrestrial time to secondary standards because this choice
results in greater short-term stability. Contrary to conventionalist explanations of stability,
then, the short-term stability of UTC cannot be fully explained by the presence of an
independently detectable regularity in the data from individual clocks. Instead, a complete
explanation must non-reducibly appeal to stabilizing policies adopted by metrological
institutions. These policies are designed in part to promote a socially optimal compromise
among those institutions.
Coordination is nonetheless not arbitrary. The sort of freedom metrologists exercise in
standardizing quantity concepts is quite different than the sort of freedom typically
associated with arbitrary definition. As the recurring qualification ‘to some extent’ in the
discussion above hints, the freedom exercised by metrologists in practice is severely, though
not completely, constrained by empirical considerations. First, the quantity concepts being
133
standardized are not ‘free-floating’ concepts but are already embedded in a web of
assumptions. Terrestrial time, for example, is a notion that is already deeply saturated with
assumptions from general relativity, atomic theory, electromagnetic theory and quantum
mechanics. The task of standardizing terrestrial time in a consistent manner is therefore
constrained by the need to maintain compatibility with established standards for other
quantities that feature in these theories. Second, terrestrial time may be approximated in
more than one way. The question ‘how well does clock X approximate terrestrial time?’ is
therefore still largely an empirical question even in the context of a standardization project. It
can be answered to a good degree of accuracy by comparing the outcomes of clock X with
other approximations of terrestrial time. Such approximations rely on post-processed data
from primary cesium standards or on astronomical time measurements derived from the
observation of pulsars. But these approximations of terrestrial time do not completely agree
with one another. More generally, different applications of the same concept to different
domains, or in light of a different trade-off between goals, often end up being somewhat
discrepant in their results. Standardization institutes continually manage a delicate balance
between the extent of legislative freedom they allow themselves in applying concepts and the
inevitable gaps discovered among multiple applications of the same concept. Nothing
exemplifies better the shifting attitudes of the BIPM towards this trade-off than the history
of ‘steering’ corrections, which have been dispensed aggressively or smoothly over the past
decades depending on whether accuracy or stability was preferred.
The gaps discovered between different applications of the same quantity-concept are
among the most important (though by no means the only) pieces of empirical knowledge
amassed by standardization projects. Such gaps constitute empirical discoveries concerning the
existence or absence of regularities in the behavior of instruments, and not merely about the
134
way metrologists use their concepts. This is a crucial point, as failing to appreciate it risks
mistaking standardization projects for exercises in the social regulation of data-analysis
practices. Even if metrologists reached perfect consensus as to how they apply a given
quantity concept, there is no guarantee that the application they have chosen will lead to
consistent results. Success and failure in applying a quantity concept consistently are to be
investigated empirically, and the discovery of gaps (or their absence) is accordingly a matter
of obtaining genuine empirical knowledge about regularities in nature.
The discovery of gaps explains the possibility of stabilizing networks of standards in
the long run. Metrologists choose to use as standards those instruments to which they have
managed to apply the relevant concept most consistently, i.e. with the smallest gaps. To
return to the example above, metrologists have succeeded in applying the concept of
temporal uniformity to different cesium atoms with much smaller gaps than to different
heart rates. This is not only a fact about the way metrologists apply the concept of
uniformity, but also about a natural regularity in the behavior of cesium atoms, a regularity
that is discovered when cesium clocks are subsumed under the concept of uniformity
through the mediation of relevant models. Metrologists rely on such regularities for their
choices of physical standards, i.e. they tend to select those instruments whose behavior
requires the smallest and least frequent ad hoc corrections. Moreover, as standardization
projects progress, metrologists often find new theoretical and statistical means of predicting
some of the gaps that remain, thereby discovering ever ‘tighter’ regularities in the behavior
of their instruments.
The notion of empirical regularity employed by the model-based account differs from
the empiricist one adopted by classical conventionalists. Conventionalists equated regularity
with a repeatable relation among observations. Carnap, for example, identified regularity in
135
the behavior of pendulums with the constancy of the ratio between the number of swings
they produce ([1966] 1995, 82). This naive empiricist notion of regularity pertains to the
indications of instruments. By contrast, my notion of regularity pertains to measurement
outcomes, i.e. to estimates that have already been corrected in light of theoretical and statistical
assumptions92. The behavior of measuring instruments is deemed regular relative to some set
of modeling assumptions insofar as their outcomes are predictable under those assumptions.
Prior to the specification of modeling assumptions there can be no talk of regularities,
because such assumptions are necessary for forming expectations about which configuration
of indications would count as regular. Hence modeling assumptions are strongly constitutive
of empirical regularities in my sense of the term. At the same time, regularities are still
empirical, as their existence depends on which indications instruments actually produce.
Empirical regularities, in other words, are co-produced by observations as well as the
assumptions with which a scientific community interprets those observations.
This Kantian-flavored, dual-source conception of regularity explains the possibility of
legislating to nature the conditions under which time intervals are deemed equal. Recall that
acts of legislation determine not only how concepts are applied, but also which
configurations of observations count as regular. For example, which clocks ‘tick’ closer to
the natural frequency of the cesium transition depends on which rules metrologists choose
to follow in applying the concept of natural uniformity93. This is not meant to deny that
there may be mind-independent facts about the frequency stability of clocks, but merely to
92 My analysis of the notion of empirical regularity is therefore similar to my analysis of the notion of agreement discussed in Chapter 2.
93 Kant would have disagreed with last statement, as he took time to be a universal form of intuition and the synthesis of temporal relations to be governed by universal schemata regardless of one’s theoretical suppositions. The inspiration I draw from Kant does not imply a wholesale adoption of his philosophy.
136
acknowledge that such mind-independent facts, if they exist, play no role in grounding
knowledge claims about frequency stability. Indeed, the standardization of terrestrial time
would be impossible were metrologists required to obtain such facts, which pertain to ideal
and experimentally inaccessible conditions. From the point of view of the model-based
account, by contrast, there is nothing problematic about this inaccessibility, as the
application of a concept does not require satisfying its theoretical definition verbatim.
Instead, metrologists have a limited but genuine authority to legislate empirical regularities to
their observations, and hence to decide which approximations of the definition are closer
than others, despite not having experimental access to the theoretical ideal.
3.5. Conclusions
This chapter has argued that the stability of the worldwide consensus around
Coordinated Universal Time cannot be fully explained by reduction to either the natural
regularity of atomic clocks or the consensus-building policies enforced by standardization
institutes. Instead, both sorts of explanans dovetail through an ongoing modeling activity
performed by metrologists. Standardization projects involve an iterative exchange between
‘top-down’ adjustment to the mode of application of concepts and ‘bottom-up’ discovery of
inconsistencies in light of this application94.
94 This double-sided methodological configuration is an example of Hasok Chang’s (2004, 224-8) ‘epistemic iterations.’ It is also reminiscent of Andrew Pickering’s (1995, 22) patterns of ‘resistance and accommodation’, with the important difference that Pickering does not seem to ascribe his ‘resistances’ to underlying natural regularities.
137
This bidirectional exchange results in greater stability as it allows metrologists to latch
onto underlying regularities in the behavior of their instruments while redistributing errors in
a socially optimal manner. When modeling the behavior of their clocks, metrologists are to
some extent free to decide which behaviors count as naturally regular, a freedom which they use
to maximize the efficiency of a social compromise among standardizing institutions. The
need for effective social compromise is therefore one of the factors that determine the
empirical content of the concept of a uniformly ‘ticking’ clock. On the other hand, the need
for consistent application of this concept is one of the factors that determine which social
compromise is most effective. The model-based account therefore combines the
conventionalist claim that congruity is a description-relative notion with the constructivist
emphases on the local, material and historical contexts of scientific knowledge.
138
The Epistemology of Measurement: A Model-Based Account
4. Calibration: Modeling the Measurement Process
Abstract: I argue that calibration is a special sort of modeling activity, namely the activity of constructing, testing and deriving predictions from theoretical and statistical models of a measurement process. Measurement uncertainty is accordingly a special sort of predictive uncertainty, namely the uncertainty involved in predicting the outcomes of a measurement process based on such models. I clarify how calibration establishes the accuracy of measurement outcomes and the role played by measurement standards in this procedure. Contrary to currently held views, I show that establishing a correlation between instrument indications and standard quantity values is neither necessary nor sufficient for successful calibration.
4.1. Introduction
A central part of measuring is evaluating accuracy. A measurement outcome that is not
accompanied by an estimate of accuracy is uninformative and hence useless. Even when a
value range or standard uncertainty are not explicitly reported with a measurement outcome,
a rough accuracy estimate is implied by the practice of recoding only ‘meaningful digits’. And
yet the requirement to evaluate accuracy gives rise to an epistemological conundrum, which I
have called ‘the problem of accuracy’ in the introduction to this thesis. The problem arises
because the exact values of most physical quantities are unknowable. Quantities such as
139
length, duration and temperature, insofar as they are represented by non-integer (e.g. rational
or real) numbers, are impossible to measure with certainty. The accuracy of measurements
of such quantities cannot, therefore, be evaluated by reference to exact values but only by
comparing uncertain estimates to each other. When comparing two uncertain estimates of
the same quantity it is impossible to tell exactly how much of the difference between them is
due to the inaccuracy of either estimate. Multiple ways of distributing errors among the two
estimates are consistent with the data. The problem of accuracy, then, is an
underdetermination problem: the available evidence is insufficient for grounding claims
about the accuracy of any measurement outcome in isolation, independently of the
accuracies of other measurements95.
One attempt to solve this problem which I have already discussed is to adopt a
conventionalist approach to accuracy. Mach ([1896] 1966) and later Carnap ([1966] 1995)
and Ellis (1966) thought that the problem of accuracy could be solved by arbitrarily selecting
a measuring procedure as a standard. The accuracies of other measuring procedures are then
evaluated against the standard, which is considered completely accurate. The disadvantages
of the conventionalist approach to accuracy have already been explored at length in the
previous chapters. As I have shown, measurement standards are necessarily inaccurate to
some extent, because the definitions of the quantities they standardize necessarily involve
95 The problem of accuracy can be formulated in other ways, i.e. as a regress or circularity problem rather than an underdetermination problem. In the regress formulation, the accuracy of a set of estimates is established by appealing to the accuracy of yet another estimate, etc. In the circularity formulation, the accuracy of one estimate is established by appealing to the accuracy of a second estimate, whose accuracy is in turn established by appeal to the accuracy of the first. All of these formulations point to the same underlying problem, namely the insufficiency of comparisons among uncertain estimates for determining accuracy. I prefer the underdetermination formulation because it makes it easiest to see why auxiliary assumptions about the measuring process can help solve the problem.
140
some idealization96. Moreover, the inaccuracies associated with measurement standards are
themselves evaluated by mutual comparisons among standards, a fact that further
accentuates the problem of accuracy.
In Chapter 1 I provided a solution to the problem of accuracy in the special case of
primary measurement standards. I showed that a robustness test performed among the
uncertainties ascribed to multiple standards provides sufficient grounds for making accuracy
claims about those standards. The task of the current chapter is to generalize this solution to
any measuring procedure, and to explain how the methods actually employed in physical
metrology accomplish this solution. Specifically, my aim will be to clarify how the various
activities that fall under the title ‘calibration’ support claims to measurement accuracy.
At first glance this task may appear simple. It is commonly thought that calibration is
the activity of establishing a correlation between the indications of a measuring instrument
and a standard. Marcel Boumans, for example, states that “A measuring instrument is
validated if it has been shown to yield numerical values that correspond to those of some
numerical assignments under certain standard conditions. This is also called calibration
[…].” (2007, 236). I have already shown that there is good reason to think that primary
measurement standards are accurate up to their stated uncertainties. Is it not obvious that
calibration, which establishes a correlation with standard values, thereby also establishes the
accuracy of measuring instruments?
96 Even when the definition of a unit refers to a concrete object such as the Prototype Kilogram, the specification of a standard measuring procedure still involves implicit idealizations, such as the possibility of creating perfect copies of the Prototype and the possibility of constructing perfect balances to compare the mass of the Prototype to those of other objects.
141
This seemingly straightforward way of thinking about calibration neglects a more
fundamental epistemological challenge, namely the challenge of clarifying the importance of
standards for calibration in the first place. Given that the procedures called ‘standards’ are to
some extent inaccurate, and given that some measuring procedures are more accurate than
the current standard (as shown in Chapter 1), why should one calibrate instruments against
metrological standards rather than against any other sufficiently accurate measuring
procedure?
In what follows I will show that establishing a correlation between instrument
indications and standard values is neither necessary nor sufficient in general for successful
calibration. The ultimate goal of calibration is not to establish a correlation with a standard,
but to accurately predict the outcomes of a measuring procedure. Comparison to a standard is but one
method for generating such predictions, a method that is not always required and is often
inaccurate by itself. Indeed, only in the simplest and most inaccurate case of calibration
(‘black-box’ calibration) is predictability achieved simply by establishing empirical
correlations between instrument indications and standard values. A common source of
misconceptions about calibration is that this simplest form of calibration is mistakenly
thought to be representative of the general case. The opposite is true: ‘black-box’ calibration
is but a special case of a much more complex way of representing measuring instruments
that involves detailed theoretical and statistical considerations.
As I will argue, calibration is a special sort of modeling activity, one in which the
system being modeled is a measurement process. I propose to view calibration as a modeling
activity in the full-blown sense of the term ‘modeling’, i.e. constructing an abstract and
142
idealized representation of a system from theoretical and statistical assumptions and using
this representation to explain and predict that system’s behaviour97.
I will begin by surveying the products of calibration as explicated in the metrological
literature (Section 4.2) and distinguish between two calibration methodologies that,
following Boumans (2006), I call ‘black-box’ and ‘white-box’ calibration (Sections 4.3 and
4.4). I will show that white-box calibration is the more general of the two, and that it is
aimed at predicting measurement outcomes rather than mapping indications to standard
values. Section 4.5 will then discuss the role of metrological standards in calibration and
clarify the conditions under which their use contributes to the accurate prediction of
measurement outcomes. Finally, Section 4.6 will explain how the accuracy of measurement
outcomes is evaluated on the basis of the model-based predictions produced during
calibration.
4.2. The products of calibration
4.2.1. Metrological definition
The International Vocabulary of Metrology (VIM) defines calibration in the following way:
Calibration: operation that, under specified conditions, in a first step, establishes a relation between the quantity values with measurement uncertainties provided by measurement standards and corresponding indications with associated measurement uncertainties and, in a second step, uses this information to establish a relation for obtaining a measurement result from an indication. (JCGM 2008, 2.39)
97 See also Mari 2005.
143
This definition is functional, that is, it characterizes calibration through its products.
Two products are mentioned in the definition, one intermediary and one final. The final
product of calibration operations is “a relation for obtaining a measurement result from an
indication”, whereas the intermediary product is “a relation between the quantity values […]
provided by measurement standards and corresponding indications”. Calibration therefore
produces knowledge about certain relations. My aim in this section will be to explicate these
relations and their relata. The following three sections will then provide a methodological
characterization of calibration, namely, a description of several common strategies by which
metrologists establish these relations. In each case I will show that the final product of
calibration – a relation for obtaining a measurement result from an indication – is established
by making model-based predictions about the measurement process. This methodological
characterization will in turn set the stage for the epistemological analysis of calibration in the
last section.
4.2.2. Indications vs. outcomes
The first step in elucidating the products of calibration is to distinguish between
measurement outcomes (or ‘results’) and instrument indications, a distinction previously
discussed in Chapter 2. To recapitulate, an indication is a property of the measuring
instrument in its final state after the measurement process is complete. Examples of
indications are the numerals appearing on the display of a digital clock, the position of an
ammeter pointer relative to a dial, and the pattern of diffraction produced in x-ray
crystallography. Note that the term ‘indication’ in the context of the current discussion
144
carries no normative connotation. It does not presuppose reliability or success in indicating
anything, but only an intention to use such outputs for reliable indication of some property of
the sample being measured. Note also that indications are not numbers: they may be
symbols, visual patterns, acoustic signals, relative spatial or temporal positions, or any other
sort of instrument output. However, indications are often represented by mapping them
onto numbers, e.g. the number of ‘ticks’ the clock generated at a given period, the
displacement of the pointer relative to the ammeter dial, or the spatial density of diffraction
fringes. These numbers, which may be called ‘processed indications’, are convenient
representations of indications in mathematical form98. A processed indication is not yet an
estimate of any physical quantity of the sample being measured, but only a mathematical
description of a state of the measuring apparatus.
A measurement outcome, by contrast, is an estimate of a quantity value associated with
the object being measured, an estimate that is inferred from one or more indications.
Outcomes are expressed in terms of a particular unit on a particular scale and include, either
implicitly or explicitly, an estimate of uncertainty. Respective examples of measurement
outcomes are an estimate of duration in seconds, an estimate of electric current in Ampere,
and an estimate of distance between crystal layers in nanometers. Very often measurement
outcomes are recorded in the form of a mean value and a standard deviation that represents
the uncertainty around the mean, but other forms are commonly used, e.g. min-max value
range.
98 The difference between numbers and numerals is important here. Before processing, an indication is never a number, though it may be a numeral (i.e. a symbol representing a number).
145
To attain the status of a measurement outcome, an estimate must be abstracted away
from its concrete method of production and pertain to some quantity objectively, namely, be
attributable to the measured object rather than the idiosyncrasies of the measuring instrument,
environment and human operators. Consider the ammeter: the outcome of measuring with
an ammeter is an estimate of the electric current running through the input wire. The
position of the ammeter pointer relative to the dial is a property of the ammeter rather than
the wire, and is therefore not a candidate for a measurement outcome. This is the case
whether or not the position of the pointer is represented on a numerical scale. It is only once
theoretical and statistical background assumptions are made and tested about the behaviour
of the ammeter and its relationship with the wire (and other elements in its environment)
that one can infer estimates of electric current from the position of the pointer. The ultimate
aim of calibration is to validate such inferences and characterize their uncertainty.
Processed indications are easily confused with measurement outcomes partly because
many instruments are intentionally designed to conceal their difference. Direct-reading
instruments, e.g. household mercury thermometers, are designed so that the numeral that
appears on their display already represents the best estimate of the quantity of interest on a
familiar scale. The complex inferences involved in arriving at a measurement outcome from
an indication are ‘black-boxed’ into such instruments, making it unnecessary for users to
infer the outcome themselves99. Regardless of whether or not users are aware of them, such
inferences form an essential part of measuring. They link claims such as ‘the pointer is
99 Somewhat confusingly, the process of ‘black-boxing’ is itself sometimes called ‘calibration’. For example, the setting of the null indication of a household scale to the zero mark is sometimes referred to as ‘calibration’. From a metrological viewpoint, this terminological confusion is to be avoided: “Adjustment of a measuring system should not be confused with calibration, which is a prerequisite for adjustment” (JCGM 2008, 3.11, Note 2). Calibration operations establish a relation between indications and outcomes, and this relation may later be expressed in a simpler manner by adjusting the display of the instrument.
146
between the 0.40 and 0.41 marks on the dial’ to claims like ‘the current in the wire is
0.405±0.005 Ampere’. If such inferences are to be deemed reliable, they must be grounded
in tested assumptions about the behaviour of the instrument and its interactions with the
sample and the environment.
4.2.3. Forward and backward calibration functions
The distinction between indications and outcomes allows us to clarify the two
products of calibration mentioned in the definition above. The intermediary product, recall,
is “a relation between the quantity values with measurement uncertainties provided by
measurement standards and corresponding indications with associated measurement
uncertainties”. This relation may be expressed in the form of a function, which I will call the
‘forward calibration function’:
<indication> = fFC ( <quantity value>, <additional parameter values>) (4.1)
The forward calibration function maps values of the quantity to be measured – e.g. the
current in the wire – to instrument indications, e.g. the position of the ammeter pointer100.
100 The term ‘calibration function’ (also ‘calibration curve’, see JCGM 2008, 4.31) is commonly used in metrological literature, whereas the designations ‘forward’ and ‘backward’ are my own. I call this a ‘forward’ function because its input values are normally understood as already having a determinate value prior to measurement and as determining its output value through a causal process. Nevertheless, my account of
147
The forward calibration function may include input variables representing additional
quantities that may influence the indication of the instrument – for example, the intensity of
background magnetic fields in the vicinity of the ammeter. The goal of the first step of
calibration is to arrive at a forward calibration function and characterize the uncertainties
associated with its outputs, i.e. the instrument’s indications. This involves making theoretical
and statistical assumptions about the measurement process and empirically testing the
consequences of these assumptions, as we shall see below.
The second and final step of calibration is aimed at establishing “a relation for
obtaining a measurement result from an indication.” This relation may again be expressed in
the form of a function, which may be called the ‘backward calibration function’ or simply
‘calibration function’:
<quantity value> = fC ( <indication>, <additional parameter values>) (4.2)
A calibration function maps instrument indications to values of the quantity being
measured, i.e. to measurement outcomes. Like the forward function, the calibration function
may include additional input variables whose values affect the relation between indications
and outcomes. In the simplest (‘black-box’) calibration procedures additional input
parameters are neglected, and a calibration function is obtained by simply inverting the
forward function. Other (‘white-box’) calibration procedures represent the measurement
process is more detail, and the derivation of the calibration function becomes more
calibration does not presuppose this classical picture of measurement, and is compatible with the possibility that the quantity being measured does not have a determinate value prior to its measurement.
148
complex. Once a calibration function is established, metrologists use it to associate values of
the quantity being measured with indications of the instrument.
So far I have discussed the products of calibration without explaining how they are
produced. My methodological analysis of calibration will proceed in three stages, starting
with the simplest method of calibration and gradually increasing in complexity. The cases of
calibration I will consider are:
1. Black-box calibration against a standard whose uncertainty is negligible
2. White-box calibration against a standard whose uncertainty is negligible
3. White-box calibration against a standard whose uncertainty is non-negligible (‘two-
way white-box’ calibration)
In each case I will show that the products of calibration are obtained by constructing
models of the measurement process, testing the consequences of these models and deriving
predictions from them. Viewing calibration as a modeling activity will in turn provide the key
to understanding how calibration establishes the accuracy of measurement outcomes.
4.3. Black-box calibration
In the most rudimentary case of calibration, the measuring instrument is treated as a
‘black-box’, i.e. as a simple input-output unit. The inner workings of the instrument and the
various ways it interacts with the sample, environment and human operators are either
neglected or drastically simplified. Establishing a calibration function is then a matter of
149
establishing a correlation between the instrument’s indications and corresponding quantity
values associated with a measurement standard.
For example, a simple caliper may be represented as a ‘black-box’ that converts the
diameter of an object placed between its legs to a numerical reading. The caliper is calibrated
by concatenating gauge blocks – metallic bars of known length – between the legs of the
caliper. We can start by assuming, for the time being, that the uncertainties associated with
the length of these standard blocks are negligible relative to those associated with the
outcomes of the caliper measurement. Calibration then amounts to a behavioural test of the
instrument under variations to the standard sample. The indications of the caliper are
recorded for different known lengths and a curve is fitted to the data points based on
background assumptions about how the caliper is expected to behave. The resulting forward
calibration function is of the form:
I0 = fFC (O) (4.3)
This function maps the lengths (O) associated with a combination of gauge blocks to
the indications of the caliper (I0). Notice that despite the simplicity of this operation, some
basic theoretical and statistical assumptions are involved. First, the shape chosen for fFC
depends on assumptions about the way the caliper converts lengths to indications. Second,
the use of gauge blocks implicitly assumes that length is additive under concatenation
operations. These assumptions are theoretical, i.e. they suppose that length enters into
certain nomic relations with other quantities or qualities. Third, associating uncertainties with
the indications of the caliper requires making one or more statistical assumptions, for
example, that the distribution of residual errors is normal. All of these assumptions are
150
idealizations: the response of any real caliper is not exactly linear, the concatenation of
imperfect rods is not exactly additive, and the distribution of errors is never exactly normal.
The first step of calibration is meant to test how well these idealizations, when taken
together, approximate the actual behaviour of the caliper. If they fit the data closely enough
for the needs of the application at hand, these idealized assumptions are then presumed to
continue to hold beyond the calibration stage, when the caliper is used to estimate the
diameter of non-standard objects. Under a purely behavioural, black-box representation of
the caliper, its calibration function is obtained by simple inversion of the forward function:
O = fC (I0) = f -1FC (I0) (4.4)
The calibration function expresses a hypothetical nomic relation between indications
and outcomes, a relation that is derived from a rudimentary theoretical-statistical model of
the instrument. This function may now be used to generate predictions concerning the
outcomes of caliper measurements. Whenever the caliper produces indication i the diameter
of the object between the caliper’s legs is predicted to be o = fC (i). Under this simple
representation of the measurement process, the uncertainty associated with measurement
outcomes arises wholly from uncontrolled variations to the indications of the instrument.
These variations are usually represented mathematically by applying statistical measures of
variation (such as standard deviation) to a series of observed indications. This projection is
based on an inductive argument and its precision therefore depends on the number of
indications observed during the first step of calibration.
Black-box calibration is useful when the behaviour of the device is already well-
understood and when the required accuracy is not too high. Because the calibration function
151
takes only one argument, namely the instrument indication (I0), the resulting quantity-value
estimate (O) is insensitive to other parameters that may influence the behaviour of the
instrument. Such parameters may have to do with interactions among part of the instrument,
the sample being measured, and the environment. They may also have to do with the
operation and reading of the instrument by humans, and with the way indications are
recorded and processed.
The neglect of these additional factors limits the ability to tell whether, and under what
conditions, a black-box calibration function can be expected to yield reliable predictions. As
long as the operating conditions of the instrument are sufficiently similar to calibration
conditions, one can expect the uncertainties associated with its calibration function to be
good estimates of the uncertainty of measurement outcomes. However, black-box
calibration represents the instrument too crudely to specify which conditions count as
‘sufficiently similar’. As a result, measurement outcomes generated through black-box
calibration are exposed to systematic errors that arise when measurement conditions change.
4.4. White-box calibration
4.4.1. Model construction
White-box calibration procedures represent the measurement process as a collection of
modules. This differs from the black-box approach to calibration, which treats the
measurement process as a single input/output unit. Each module is characterized by one or
more state parameters, laws of temporal evolution, and laws of interaction with other
152
modules. The collection of modules and laws constitutes a more detailed (but still idealized)
model of the measurement process than a black-box model.
Typically, a white-box model of a measuring process involves assumptions concerning:
(i) components of the measuring instrument and their mutual interactions (ii) the measured
sample, including its preparation and interaction with the instrument (iii) elements in the
environment (‘background effects’) and their interactions with both sample and instrument,
(iv) variability among human operators and (v) data recording and processing procedures.
Each of these five aspects may be represented by one or more modules, though not every
aspect is represented in every case of white-box calibration.
A white-box representation of a simple caliper measurement is found in Schwenke et
al. (2000, 396.) Figure 4.1 illustrates the modules and parameters involved. The measuring
instrument is represented by the component modules ‘leg’ and ‘scale’; the sample and its
interaction with the instrument by the modules ‘workpiece’ and ‘contact’, and the data by the
module ‘readout’. The environment is represented only indirectly by its influence on the
temperatures of the workpiece and scale, and variability among human operators is
completely neglected. Of course, one can easily imagine more or less detailed breakdowns of
a caliper into modules than the one offered here. The term ‘white-box’ should be
understood as referring to a wide variety of modular representations of the measurement
process with differing degrees of complexity, rather than a unique mode of representation101.
101 Simple modular representations are sometimes referred to as ‘grey-box’ models. See Boumans (2006, 121-2).
153
Figure 4.1: Modules and parameters involved in a white-box calibration of a simple caliper (Source: Schwenke et al 2000)
The multiplicity of modules in white-box representations means that additional
parameters are included in the forward and backward calibration functions, parameters that
mediate the relation between outcomes and indications. In the caliper example, these
parameters include the temperatures and thermal expansion coefficients of the workpiece
and scale, the roughness of contact between the workpiece and caliper legs, the Abbe-error
(‘wiggle room’) of the legs relative to each other, and the resolution of the readout. These
parameters are assumed to enter into various dependencies with each other as well as with
the quantity being measured and the indications of the instrument. Such dependencies are
specified in light of background theories and tested through secondary experiments on the
apparatus.
Engineers who design, construct and test precision measuring instruments typically
express these dependencies in mathematical form, i.e. as equations. Such equations represent
the laws of evolution and interaction among different modules in a manner that is amenable
to algebraic manipulation. The forward and backward calibration functions are then
obtained by solving this set of equations and arriving at a general dependency relation
154
among model parameters102. The general form of a white-box forward calibration function
is:
I0 = fFC (O , I1 , I2 , I3 , … In) (4.5)
where I0 is the model’s prediction concerning the processed indication of the
instrument, O the quantity being measured, and I1,… In additional parameters. As before, O is
obtained by reference to a measurement standard whose associated uncertainties may for the
time being be neglected. The additional parameter values in the forward function are
estimated by performing additional measurements on the instrument, sample and
environment, e.g. by measuring the temperatures of the caliper and the workpiece, the
roughness of the contact etc.
4.4.2. Uncertainty estimation
In the first step of white-box calibration, the forward function is derived from model
equations and tested against the actual behaviour of the instrument. Much like the black-box
case, testing involves recording the indications produced by the instrument in response to a
set of standard samples, and comparing these indications with the indications I0 predicted by
the forward function. But the analysis of residual errors is more complex in the white-box
case, because the instrument is represented in a more detailed way. On the one hand,
102 For the set of equations representing the caliper measurement process and a derivation of its forward calibration function see Schwenke et al. (2000, 396), eq. (3) and (4).
155
deviations between actual and predicted indications may be treated as uncontrolled (so-called
‘random’) variations to the measurement process. Just like the black-box case, such
deviations are accounted for by modeling the residual errors statistically and arriving at a
measure of their probability distribution. Uncertainties evaluated in this way are labelled
‘type-A’ in the metrological literature103. On the other hand, observed indications may also
deviate from predicted indications because these predictions are based on erroneous
estimates of additional parameters I1,… In. A common source of uncertainty when predicting
indications is the fact that additional parameters I1,… In are estimated by performing
secondary measurements, and these measurements suffer from uncertainties of their own.
The effects of these ‘type-B’ uncertainties are evaluated by propagating them through the
model’s equations to the predicted indication I0. This alternative way of evaluating
uncertainty is not available under a black-box representation of the instrument because such
representation neglects the influence of additional parameters. In white-box calibration, by
contrast, both type-A and type-B methods are available and can be used in combination to
explain the total deviation between observed and predicted indications.
An example of the propagation of type-B uncertainties has already been discussed in
Chapter 1, namely the method of uncertainty budgeting. Individual uncertainty contributions are
evaluated separately and then summed up in quadrature (that is, as a root sum of squares). A
crucial assumption of this method is that the uncertainty contributions are independent of
each other.
Table 4.1 is an example of an uncertainty budget drawn for a measurement of the
Newtonian gravitational constant G with a torsion pendulum (Luo et al. 2009). In a
103 For details see JCGM (2008a).
156
contemporary variation on the 1798 Cavendish experiment, the pendulum is suspended in a
vacuum between two masses, and G is measured by determining the difference in torque
exerted on the pendulum at different mass-pendulum alignments. The white-box
representation of the apparatus is composed of several modules (pendulum, masses, fibre
etc.) and sub-modules, each associated with one or more quantities whose estimation
contributes uncertainty to the measured value of G. The last item in the budget is the
‘statistical’, namely type-A uncertainty arising from uncontrolled variations. The total
uncertainty associated with the measurement is then calculated as the quadratic sum of
individual uncertainty contributions.
Table 4.1: Uncertainty budget for a torsion pendulum measurement of G, the Newtonian gravitational constant. Values are expressed in 10-6. A diagram of the apparatus appears on the right.
(Source: Luo et al. 2009, 3)
157
The method of uncertainty budgeting is computationally simple. As long as the
uncertainties of different input quantities are assumed to be independent of each other, their
propagation to the measurement outcome can be calculated analytically. A more
computationally challenging case occurs when model parameters depend on each other in
nonlinear ways, thereby making it difficult or impossible to propagate the uncertainties
analytically. In such cases uncertainty estimates can sometimes be derived through computer
simulation. This is the case when metrologists attempt to calibrate coordinate measuring
machines (CMMs), i.e. instruments that measure the shape and texture of three-dimensional
objects by recording a series of coordinates along their surface. These instruments are
calibrated by constructing idealized models that represent aspects of the instrument
(amplifier linearity, probe tip radius), the sample (roughness, thermal expansion), the
environment (frame vibration) and the data acquisition mechanism (sampling algorithm.)
Each such input parameter has a probability density function associated with it. The model
along with the probability density functions then serve to construct a Monte Carlo
simulation that samples the input distributions and propagates the uncertainty to
measurement outcomes (Schwenke et al 2000, Trenk et al 2004). This sort of computer-
simulated calibration has become so prevalent that in 2008 the International Bureau of
Weights and Measures (BIPM) published a ninety-page supplement to its “Guide to the
Expression of Uncertainty in Measurement” dealing solely with Monte Carlo methods
(JCGM 2008b)104.
104 The topic of uncertainty propagation is, of course, much more complex than the discussion here is able to cover. Apart from the methods of uncertainty budgeting and Monte Carlo several other methods of uncertainty propagation are commonly applied to physical measurement, including the Taylor method (Taylor 1997), probability bounds analysis, and Bayesian analysis (Draper 1995.)
158
During uncertainty evaluation, the forward calibration function is iteratively tested for
compatibility with observed indications. Deviations from the predicted indications that fail
to be accounted for by either type-A or type-B methods are usually a sign that the white-box
model is misrepresenting the measurement process. Sources of potential misrepresentation
include, for example, a neglected or insufficiently controlled background effect, an
inadequate statistical model of the variability in indications, an error in measuring an
additional parameter, or an overly simplified representation of the interaction between
certain modules. Much like other cases of scientific modeling and experimenting, white-box
calibration involves iterative modifications to the model of the apparatus as well as to the
apparatus itself in an attempt to account for remaining deviations. The stage at which this
iterative process is deemed complete depends on the degree of measurement accuracy
required and on the ability to physically control the values of additional parameters.
4.4.3. Projection
Once sufficiently improved to account for deviations, the white-box model is
projected beyond the circumstances of calibration onto the circumstances that are presumed
to obtain during measurement. This is the second step of calibration, which involves the
derivation of a backward function from model equations. The general form of a white-box
backward calibration function is:
O = fC (I0 , I1 , I2 , I3 , … In) (4.6)
159
In general, a white-box calibration function cannot be obtained by inverting the
forward function, but requires a separate derivation. Nevertheless, the additional parameters
I1…In are often presumed to be constant and equal (within uncertainty) to the values they
had during the first step of calibration. For example, metrologists assume that the caliper will
be used to measure objects whose temperature and roughness are the same (within
uncertainty) as those of the workpieces that were used to calibrate it. This assumption of
constancy has a double role. First, it allows metrologists to easily obtain a calibration
function by inverting the forward function. Second, the assumption of constancy is
epistemically important, as it specifies the scope of projectability of the calibration function. The
function is expected to predict measurement outcomes correctly only when additional
parameters I1…In fall within the value ranges specified. If circumstances differ from this
narrow specification, it is necessary to derive a new calibration function for these new
circumstances prior to obtaining measurement outcomes.
This last point sheds light on an important difference between black-box and white-
box calibration: they involve different trade-offs between predictive generality and predictive
accuracy. Black-box calibration models predict the outcomes of measuring procedures under
a wide variety of circumstances, but with relatively low accuracy, as they fail to take into
account local factors that intervene on the relation between indications and outcomes.
White-box calibration operations, on the other hand, specify such local factors narrowly, but
their predictions are projectable only within that narrow scope. Of course, a continuum lies
between these two extremes. A white-box calibration function can be made more general by
widening the specified value range of its additional parameters or by considering fewer such
160
additional parameters. In so doing, the uncertainty associated with its predictions generally
increases.
4.4.4. Predictability, not just correlation
Another epistemically important difference between black-box and white-box
calibration is the role played by measurement standards in each case. In black-box
calibration, one attempts to obtain a stable correlation between the processed indications of
the measuring instrument and standard values of the quantity to be measured. By ‘stable’ I
mean repeatable over many runs; by ‘correlation’ I mean a mapping between two variables
that is unique (i.e. bijective) up to some stated uncertainty. For example, the black-box
calibration of a caliper is considered successful if a stable correlation is obtained between its
readout and the number of 1-milimiter standard blocks concatenated between caliper legs.
This may lead one to hastily conclude that obtaining such correlations is necessary and
sufficient for successful calibration. But this last claim does not generalize to white-box
calibration, where one attempts to obtain a stable correlation between the processed
indications of the measuring instrument and the predictions of an idealized model of the
measurement process105. A correlation of the first sort does not imply a correlation of the
second sort.
105 A different way of phrasing this claim would be to say that the idealized model itself functions as a measurement standard, though this way of talking deviates from the way metrologists usually use the term ‘measurement standard’.
161
To see the point, recall that during the first step of white-box calibration one accounts
for deviations between observed indications and the indications predicted by the forward
calibration function. The forward function is derived from equations specified by an
idealized model of the measurement process. The total uncertainty associated with these
deviations is accordingly a measure of the predictability of the behaviour of the measurement
process by the model. Recall further that the indications I0 predicted by a white-box model
depend not only on standard quantity values O but also on a host of additional parameters
I1…In , as well as on laws of evolution and interaction among modules. Consequently, a mere
correlation between observed indications and standard quantity values is insufficient for
successful white-box calibration. To be deemed predictable under a given white-box model,
indications should also exhibit the expected dependencies on the values of additional
parameters.
As an example, consider the caliper once more. If the standard gauge blocks are
gradually heated as they are concatenated, theory predicts that the indications of the caliper
will deviate from linear dependence on the total length of the blocks due to the uneven
expansion rates of the blocks and the caliper. Now suppose that this nonlinearity fails to be
detected empirically – that is, the caliper’s indications do not display the sensitivity to
temperature predicted by its white-box model but instead remain linearly correlated to the
total length of the gauge blocks. It is tempting to conclude from this that the caliper is more
accurate than previously thought. This would be a mistake, however, for accuracy is a
property of an inference from indications to outcomes, and this inference has proved
inaccurate in our case. Instead, the right conclusion from such empirical finding in the
context of white-box calibration is that the model of the caliper is in need of correction. It
may be that the dependency of indications on temperature has a different coefficient than
162
presumed, or that a hidden background effect cancels out the effects of thermal expansion,
etc. Unless an adequate correction is made to the model of the caliper, the uncertainty
associated with its predictions – and hence with the outcomes of caliper measurements -
remains high despite the linear correlation between indications and standard quantity values.
It is the overall predictive uncertainty of the model, rather than the correlation of
indications with standard values, that determines the uncertainty of measurement outcomes.
We already saw that in the second step of calibration model assumptions are projected
beyond the calibration phase and used to predict measurement outcomes. The total
uncertainty associated with measurement outcomes then expresses the likelihood that the
measured quantity value will fall in a given range when the indications of the instrument are
such-and-such. In other words, measurement uncertainty is a measure of the predictability of
measurement outcomes under an idealized model of the measurement process, rather than a
measure of closeness of correlation between the observed behaviours of the instrument and
values supplied by standards.
This conclusion may be generalized to black-box calibration. Black-box calibration is,
after all, a special case of white-box calibration where additional parameters are neglected.
All sources of uncertainty are represented as uncontrolled deviations from the expected
correlation between indications and standard values, and evaluated through type-A
(‘statistical’) methods. A black-box model, in other words, is a coarse-grained representation
of the measuring process under which measurement uncertainty and closeness of correlation
with a standard happen to coincide. Nevertheless, in both black- and white-box cases theoretical
and statistical considerations enter into one’s choice of model assumptions, and in both
cases total measurement uncertainty is a measure of the predictability of the outcome under
those assumptions. Black-box calibration is simply one way to ground such predictions, by
163
making data-driven empirical generalizations about the behaviour of an instrument. Such
generalizations suffer from higher uncertainties and a fuzzier scope than the predictions of
white-box models, but have the same underlying inferential structure.
The emphasis on predictability distinguishes the model-based account from narrower
conceptions of calibration that view it as a kind of reproducibility test. Allan Franklin, for
example, defines calibration as “the use of a surrogate signal to standardize an instrument.”
(1997, 31). Though he admits that calibration sometimes involves complex inferences, in his
view the ultimate goal of such inferences is to ascertain the ability of the apparatus to
reproduce known results associated with standard samples (‘surrogate signals’). A similar
view of calibration is expressed by Woodward (1989, 416-8). These restrictive views treat
calibration as an experimental investigation of the measuring apparatus itself, rather than an
investigation of the empirical consequences of modelling the apparatus under certain
assumptions. Hence Franklin seems to claim that, at least in simple cases, the success or
failure of calibration procedures are evident through observation. The calibration of a
spectrometer, for example, is understood by Franklin as a test for the reproducibility of
known spectral lines as seen on the equipment’s readout (Franklin 1997, 34). Such views fail
to recognize that even in the simplest cases of calibration one still needs to make idealized
assumptions about the measurement process. Indeed, unless the instrument is already
represented under such assumptions reproducibility tests are useless, as there are no grounds
for telling whether a similarity of indications should be taken as evidence for a similarity in
outcomes, and whether the behaviour of the apparatus can be safely projected beyond the
test stage. Despite this, restrictive views neglect the representational aspect of calibration and
only admit the existence of an inferential dimension to calibration in special and highly
complex cases (ibid, 75).
164
4.5. The role of standards in calibration
4.5.1. Why standards?
As I have argued so far, the ultimate goal of calibration is to predict the outcomes of a
measuring procedure under a specified set of circumstances. This goal is only partially served
by establishing correlations with standard values, and must be complemented with a detailed
representation of the measurement process whenever high accuracy is required. This line of
argument raises two questions concerning the role of measurement standards. First, how
does the use of standards contribute to the accurate prediction of measurement outcomes?
Second, is establishing a correlation between instrument indications and standard values
necessary for successful calibration?
The simple answer to the first question is that standards supply reference values of the
quantity to be measured. That is, they supply values of the variable O that are plugged into
the forward calibration function, thereby allowing predictions concerning the instrument’s
indications to be tested empirically. But this answer is not very informative by itself, for it
does not explain why one ought to treat the values supplied by standards as accurate. In the
previous sections we simply assumed that standards provide accurate values, despite the fact
that (as already shown in Chapter 1) even the most accurate measurement standards have
nonzero uncertainties. Given that the procedures metrologists call ‘standards’ are not
absolutely accurate, is there any reason to use them for estimating values of O rather than
any other procedure that measures the same quantity?
As I am about to show, the answer depends on whether the question is understood in
a local or global context. Locally – for any given instance of calibration – it makes no
165
epistemic difference whether one calibrates against a metrological standard or against some
other measuring procedure, provided that the uncertainty associated with its outcomes is
sufficiently low. By contrast, from a global perspective – when the web of inter-procedural
comparisons is considered as a whole – the inclusion of metrological standards is crucial, as
it ensures that the procedures being compared are measuring the quantity they are intended
to.
4.5.2. Two-way white-box calibration
Let us begin with the local context, and consider any particular pair of procedures - call
them the ‘calibrated’ and ‘reference’ procedures. During calibration, the reference procedure
is used to measure values of O associated with certain samples; these values are plugged into
the forward function of the calibrated instrument and used to predict its indications; and
these predictions are then compared to the actual indications produced by the calibrated
instrument in response to the same (or similar) samples. For the sake of carrying out this
procedure, it makes no difference whether the reference procedure is a metrologically
sanctioned standard or not, because the accuracy of metrologically sanctioned standards is
evaluated in exactly the same way as the accuracy of any other measurement procedure. The
uncertainties associated with standard measuring procedures are evaluated by constructing
white-box models of those standards, deriving a forward calibration function, propagating
uncertainties through the model, and testing model predictions for compatibility with other
standards.
166
Table 4.2: Type-B uncertainty budget for NIST-F1, the US primary frequency standard. The clock is deemed highly accurate despite the large discrepancy between its indications and the cesium clock
frequency, because the correction factor is accurately predictable. (Source: Jefferts et al, 2007, 766)
This has already been shown in Chapter 1. For example, the cesium fountain clock
NIST-F1, which serves as the primary frequency standard in the US, has a fractional
frequency uncertainty of less than 5 parts in 1016 (Jefferts et al. 2007). This uncertainty is
evaluated by modelling the clock theoretically and statistically, drawing an uncertainty budget
for the clock, and testing these uncertainty estimates for compatibility with other cesium
fountain clocks106. Table 4.2 is a recent uncertainty budget for NIST-F1 (including only
type-B evaluations). Note that the systematic corrections applied to the clocks indications
(the total frequency ‘bias’) far exceed the total type-B uncertainty associated with the
106 By ‘modeling the clock statistically’ I mean making statistical assumptions about the variation of its indications over time. These assumptions are used to construct models of noise, such as white noise, flicker noise and Brownian noise (see also section 3.4.2.)
167
outcome. In other words, the clock ‘ticks’ considerably faster – by hundreds of standard
deviations – than the cesium frequency it is supposed to measure. The clock is nevertheless
deemed highly accurate, because the cesium frequency is predictable from clock indications
(‘ticks’) with a very low uncertainty.
Now consider a scenario in which a cesium fountain clock is used to calibrate a
hydrogen spectrometer, i.e. a device measuring the frequency associated with subatomic
transitions in hydrogen. Such calibration is described in Niering et al (2000). The accuracy
expected of the spectrometer is close to that of the standard, so that one cannot neglect the
inaccuracies associated with the standard during calibration. In this case metrologists must
consider two white-box models – one for the calibrated instrument and one for the standard
– and compare the measurement outcomes predicted by each model. These predicted
measurement outcomes already incorporate bias corrections and are associated with
estimates of total uncertainty propagated through each model. The calibration is then
considered successful if and only if the outcomes of the two clocks, as predicted by their
respective models, coincide within their respective uncertainties:
fC (I0 , I1 , I2 , I3,… In) ≈ fC’ (I0
’ , I1’ , I2
’ , I3’,… Im
’ ) (4.7)
where fC is the calibration function associated with the hydrogen spectrometer, fC’ is the
calibration function associated with the cesium standard, and ≈ stands for ‘compatible up to
stated uncertainty’. Notice the complete symmetry between the calibrated instrument and
the standard as far as the formal requirement for successful calibration goes. This symmetry
168
expresses the fact that a calibrated procedure and a reference procedure differ only in their
degree of uncertainty and not in the way uncertainty is evaluated.
This ‘two-way white-box’ procedure exemplifies calibration in its full generality, where
both the calibrated instrument and the reference are represented in a detailed manner. As
equation (4.7) makes clear, calibration is successful when it establishes a predictable
correlation between the outcomes of measuring procedures under their respective models,
rather than between their observed indications. Such correlation amounts to an empirical
confirmation that the predictions of different calibration functions are mutually compatible.
4.5.3. Calibration without metrological standards
The importance of reference procedures in calibration, then, is the fact that they are
modelled more accurately – namely, with lower predictive uncertainties – than the
procedures they are used to calibrate. It is now easy to see that a reference procedure does
not have to be a metrological standard, but instead may be any measurement procedure
whose uncertainties are sufficiently low to make the comparison informative. We already
saw an example of successful calibration without any reference to a metrological standard in
Chapter 1, where two optical clocks were calibrated against each other. In that case both
clocks had significantly lower measurement uncertainties than the most accurate
metrologically sanctioned frequency standard. More mundane examples of calibration
without metrological standards are found in areas where an institutional consensus has not
yet formed around the proper application of the measured concept. Paper quality is an
example of a complex vector of quantities (including fibre length, width, shape and
169
bendability) for which an international reference standard does not yet exist (Wirandi and
Lauber 2006). The instruments that measure these quantities are calibrated against each
other in a ‘round-robin’ ring, without a central reference standard (see Figure 4.2). This
procedure sufficiently guarantees that the outcomes of measurements taken in one
laboratory reliably predict the outcomes obtained for a similar sample at any other
participating laboratory.
Figure 4.2: A simplified diagram of a round-robin calibration scheme. Authorized laboratories calibrate their measuring instruments against each other’s, without reference to a central standard.
(source: Wirandi and Lauber 2006, 616)
These examples of ‘symmetrical’ calibration all share a common inferential structure,
namely, they establish the accuracy of measuring procedures through a robustness argument.
The predictions derived from models of multiple measuring procedures are tested for
compatibility with each other, and those that pass the test are taken to be accurate up to
their respective uncertainties. As I will argue in the next section, this inferential structure is
essential to calibration and present even in seemingly asymmetrical cases where the
inaccuracy of standards is neglected.
Before discussing robustness, let me return to the role of metrological standards in
calibration. From a local perspective, as we have already seen, metrological standards are not
qualitatively different from other measuring procedures in their ability to produce reference
170
values for calibration. Metrological standards are useful only insofar as the uncertainty
associated with their values is low enough to for calibration tests to be informative. Of
course, the values provided by metrological standards are usually associated with very low
uncertainties. But it is not by virtue of their role as standards that their uncertainties are
deemed low. The opposite is the case: the ability to model certain procedures in a way that
facilitates accurate predictions of their outcomes motivates metrologists to adopt such
procedures as standards, as already discussed in Chapter 1.
Above I posed the question: is establishing correlation with standard values necessary
for successful calibration? A partial answer can now be given. Insofar as local instances of
calibration are concerned, it is not necessary to appeal to metrologically sanctioned standards
in order to evaluate uncertainty. Establishing correlations among outcomes of different
measuring procedures is sufficient for this goal. One may, of course, call some of the
procedures being compared ‘standards’ insofar as they are modelled more accurately than
others. But this designation does not mark any qualitative epistemic difference between
standard and non-standard procedures.
4.5.4. A global perspective
The above is not meant to deny that choices of metrological standards carry with them
a special normative force. There is still an important difference between calibrating an
instrument against a non-standard reference procedure and calibrating it against a
metrological standard, even when both reference procedures are equally accurate. The
171
difference is that in the first case, if significant discrepancies are detected between the
outcomes of either procedure, either procedure is in principle equally amenable to
correction. All other things being equal, the models from which a calibration function is
derived for either procedure are equally revisable. This is not the case if the reference
procedure is a metrological standard, because a model representing a standard procedure has
a legislative function with respect to the application of the quantity concept in question.
This legislative (or ‘coordinative’) function of metrological models has been discussed
at length in Chapter 3. The theoretical and statistical assumptions with which a metrological
standard is represented serve a dual, descriptive and normative role. On the one hand, they
predict the actual behaviour of the process that serves as a standard, and on the other, they
prescribe how the concept being standardized is to be applied to that process. Metrological
standards can fulfill this legislative function because they are modelled in terms of the
theoretical definition of the relevant concept, that is, they constitute realizations of that
concept. For this reason, in the face of systematic discrepancies between the outcomes of
standard and non-standard procedures, there is a good reason to prefer a correction to the
outcomes of non-standard procedures over a correction to the outcomes of standard
procedures.
Note that this preference does not imply that metrological standards are categorically
more accurate, or accurate for different reasons, than other measuring procedures. The total
uncertainty associated with a measuring procedure is still evaluated in exactly the same way
whether or not that procedure is a metrological standard. But the second-order uncertainty
associated with metrological standards – that is, the uncertainty associated with evaluations
of their uncertainty – is especially low. This is the case because metrological standards are
modelled in terms of the theoretical definition of the quantity they realize, and their
172
uncertainties are accordingly estimates of the degree to which the realization succeeds in
approximately satisfying the definition. Such uncertainty estimates enjoy a higher degree of
confidence than those associated with non-standard measuring procedures, because the
latter are not directly derived from the theoretical definition of the measured quantity and
cannot be considered equally safe estimates of the degree of its approximate satisfaction. For
this reason, the assumptions under which non-standard measuring procedures are modelled
are usually deemed more amenable to revision than the assumptions informing the modeling
of metrological standards. This is the case even when the non-standard is thought to be
more accurate, i.e. to have lower first-order uncertainty, than the standard. For example, if
an optical atomic clock were to systematically disagree with a cesium standard, the model of
the former would be more amenable to revision despite it supposedly being the more
accurate clock.
The importance of the normative function of metrological standards is revealed from a
global perspective on calibration, when one views the web of inter-procedural comparisons
as a whole. Here metrological standards form the backbone that holds the web together by
providing a stable reference for correcting systematic errors. The consistent distribution of
systematic errors across the web makes possible its subsumption under a single quantity
concept, as explained in Chapters 2. In the absence of a unified policy for distributing errors,
nothing prevents a large web from breaking into ‘islands’ of intra-comparable but mutually
incompatible procedures. By legislating how an abstract quantity concept is to be realized,
models of metrological standards serve as a kind of ‘semantic glue’ the ties together distant
parts of the web.
As an example, consider all the clocks calibrated against Coordinated Universal Time
either directly or indirectly, e.g. through national time signals. What justifies the claim that all
173
these clocks are measuring, with varying degrees of accuracy, the same quantity – namely,
time on a particular atomic scale? The answer is that all these clocks produce consistent
outcomes when modelled in terms of the relevant quantity, i.e. UTC. But to test whether
they do, one must first determine what counts as an adequate way of applying the concept of
Coordinated Universal Time to any particular clock. This is where metrological standards
come into play: they fix a semantic link between the definition of the quantity being
standardized and each of its multiple realizations. In the case of UTC, this legislation is
performed by modeling a handful of primary frequency standards and several hundred
secondary standards in a manner that minimizes their mutual discrepancies, as described in
Chapter 3. It then becomes possible to represent non-standard empirical procedures such as
quartz clocks in terms of the standardized quantity by correcting their systematic errors
relative to the network’s backbone. In the absence of this ongoing practice of correction, the
web of clocks would quickly devolve into clusters that measure mutually incompatible
timescales.
From a global perspective, then, metrological standards still play an indispensible
epistemic role in calibration whenever (i) the web of instruments is sufficiently large and (ii)
the quantity being measured is defined theoretically. This explains why metrological rigour is
necessary for standardizing quantities that have reached a certain degree of theoretical
maturity. At the same time, the analysis above explains why metrological standards are
unnecessary for successful calibration in case of ‘nascent’ quantities such as paper quality.
174
4.6. From predictive uncertainty to measurement accuracy
We saw above (Section 4.4.) that measurement uncertainty is a kind of predictive
uncertainty. That is, measurement uncertainty is the uncertainty associated with predictions
of the form: “when the measuring instrument produces indication i the value of the
measured quantity will be o.” Such predictions are derived during calibration from statistical
and theoretical assumptions about the measurement process. Calibration tests proceed by
comparing the outcomes predicted by a model of one measuring procedure (the ‘calibrated’
procedure) to the outcomes predicted by a model of another measuring procedure (the
‘reference’ procedure). When the predicted outcomes agree within their stated uncertainties,
calibration is deemed successful. This success criterion is expressed by equation (4.7).
At first glance it seems that calibration should only be able to provide estimates of
consistency among predictions of measurement outcomes. And yet metrologists routinely use
calibration tests to estimate the accuracy of outcomes themselves. That is, they infer from
the mutual consistency among predicted outcomes that the outcomes are accurate up to
their stated uncertainties. The question arises: why should estimates of consistency among
outcomes predicted for different measuring procedures be taken as good estimates of the
accuracy of those outcomes?
The general outline of the answer should already be familiar from my discussion of
robustness in Chapter 1. There I showed how robustness tests of the form (RC), performed
among multiple realizations of the same measurement unit, ground claims to the accuracy of
those realizations. I further showed that this conclusion holds regardless of the particular
meaning of ‘accuracy’ employed – be it metaphysical, epistemic, operational, comparative or
pragmatic. The final move, then, is to expand the scope of (RC) to include measuring
175
procedures in general. The resulting ‘generalized robustness condition’ may be formulated in
the following way:
(GRC) Given multiple, sufficiently diverse processes that are used to measure
the same quantity, the uncertainties ascribed to their outcomes are
adequate if and only if
(i) discrepancies among measurement outcomes fall within their
ascribed uncertainties; and
(ii) the ascribed uncertainties are derived from appropriate models
of each measurement process.
Uncertainties that satisfy (GRC) are reliable measures of the accuracies of
measurement outcomes under all five senses of ‘measurement accuracy’, for the same
reasons that applied to (RC)107.
What remains to be clarified is how calibration operations test the satisfaction of
(GRC). Recall that calibration is deemed successful – that is, good at predicting the
outcomes of a measuring procedure to within the stated uncertainty – when the predicted
outcomes are shown to be consistent with those associated with a reference procedure. Now
consider an entire web of such successful calibration operations. Each ‘link’ in the web
stands for an instance of pairwise calibration, and is associated with some uncertainty that is
a combination of uncertainties from both calibrated and reference procedures. Assuming
that there are no cumulative systematic biases across the web, the relation of compatibility
107 See Chapter 1, Section 1.5: “A robustness condition for accuracy”.
176
within uncertainty ≈ can be assumed to be transitive108. Consequently, measurement
uncertainties that are vindicated by one pairwise calibration are traceable throughout the
web. The outcomes of any two measurement procedures in the web are predicted to agree
within their ascribed uncertainties even if they are never directly compared to each other.
The web of calibrations for a given quantity may therefore be considered an indirect
robustness test for the uncertainties associated with each individual measuring procedure.
Each additional calibration that successfully attaches its uncertainty estimates to the web
indirectly tests those estimates for compatibility with many other estimates made for a
variety of other measuring procedures. In other words, each additional calibration
constitutes an indirect test as to whether (GRC) is satisfied when the web of comparisons is
appended with a putative new member.
This conclusion holds equally well for black-box and white-box calibration, which are
but special cases of the fully general, two-way white-box case. To be sure, in these special
cases some of the complexities involved in deriving and testing model-based predictions
remain implicit. In the one-way white-box case one makes the simplifying assumption that
the behaviour of the standard is perfectly predictable. In the black-box case one additionally
makes the simplifying assumption that changes in extrinsic circumstances will not influence
the relation between indications and outcomes. These varying levels of idealization affect the
accuracy and generality with which measurement outcomes are predicted, but not the
general methodological principle according to which compatibility among predictions is the
ultimate test for measurement accuracy.
108 Note that this last assumption is only adequate when the web is small (i.e. small maximal distance among nodes) or when metrological standards are included in strategic junctions, as already discussed above.
177
4.7. Conclusions
This chapter has argued that calibration is a special sort of modelling activity. Viewed
locally, calibration is the complex activity of constructing, testing, deriving predictions from,
and propagating uncertainties through models of a measurement process. Viewed globally,
calibration is a test of robustness for model-based predictions of multiple measuring
processes. This model-based account of calibration solves the problem of accuracy posed in
the introduction to this thesis. As I have shown, uncertainty estimates that pass the
robustness test are reliable estimates of measurement accuracy despite the fact that the
accuracy of any single measuring procedure cannot be evaluated in isolation.
The key to the solution was to show that, from an epistemological point of view,
measurement accuracy is but a special case of predictive accuracy. As far as it is knowable,
the accuracy of a measurement outcome is the accuracy with which that outcome can be
predicted on the basis of a theoretical and statistical model of the measurement process. A
similar conclusion holds for measurement outcomes themselves, which are the results of
predictive inferences from model assumptions mediated through the derivation of a
calibration function. The intimate inferential link between measurement and prediction has
so far been ignored in the philosophical literature, and has potentially important
consequences for the relationship between theory and measurement.
178
The Epistemology of Measurement: A Model-Based Account
Epilogue
In the introduction to this thesis I outlined three epistemological problems concerning
measurement: the problems of coordination, accuracy and quantity individuation. In each of
the chapters that followed I argued that these problems are solved (or dissolved) by
recognizing the roles models play in measurement. A precondition for measuring is the
coherent subsumption of measurement processes under idealized models. Such
subsumption is a necessary condition for obtaining objective measurement outcomes from
local and idiosyncratic instruments indications. In addition, I have shown that contemporary
methods employed in the standardization of measuring instruments indeed achieve the goal
of coherent subsumption. Hence the model-based account meets both the general and the
practice-based epistemological challenges set forth in the introduction.
A general evidential condition for testing measurement claims has emerged from my
studies, which may be called ‘convergence under representations’. Claims to measurement,
accuracy and quantity individuation are settled by testing whether idealized models
representing different measuring processes converge to each other. This convergence
requirement is two-pronged. First, the assumptions with which models are constructed have
to cohere with each other and with background theory. Second, the consequences of
representing concrete processes under these assumptions must converge in accordance with
their associated uncertainties. When this dual-aspect convergence is shown to be sufficiently
robust under alternations to the instrument, sample and environment, all three problems are
solved simultaneously. That is, a robust convergence among models of multiple instruments
179
is sufficient to warrant claims about (i) whether the instruments measure the same quantity,
(ii) which quantity the instruments measure and (iii) how accurately each of them measures
this quantity. Of course, such knowledge claims are never warranted with complete certainty.
The ‘sufficiency’ of robustness tests may always be challenged by a new perturbation that
destroys convergence and forces metrologists to revise their models. As a result, some
second-order uncertainty is always present in the characterization of measurement
procedures.
Claims about coordination, accuracy and quantity individuation are contextual, i.e.
pertain to instruments only as they are represented by specified models. This context-
sensitivity is a consequence of recognizing the correct scope of knowledge claims made on
the basis of measurements. As I have shown, measurement outcomes are themselves
contextual and relative to the assumptions with which measurement processes are modeled.
Similarly, the notions of agreement, systematic error and measurement uncertainty all
become clear once their sensitivity to representational context is acknowledged. This,
however, does not mean that measurement outcomes lose their validity outside of the
laboratory where they were produced. On the contrary, the condition of convergence under
representations explains why measurement outcomes are able to ‘travel’ outside of the
context of their production and remain valid across a network of inter-calibrated
instruments. The fact that these instruments converge under their respective models ensures
that measurement outcomes produced by using one instrument would be reproducible
across the network, thereby securing the validity of measurement outcomes throughout the
network’s scope.
The model-based account has potentially important consequences for several ongoing
debates in the philosophy of science, consequences which are beyond the purview of this
180
thesis. One such consequence, already noted at the end of Chapter 4, is the centrality of
prediction to measurement, a discovery which calls for subtler accounts of the relationship
between theory and measurement. Another important consequence concerns the possibility
of a clear distinction between hypotheses and evidence. As we saw above, measurement
outcomes are inferred by projection from hypotheses about the measurement process. Just
like any other projective estimate, the validity of a measurement outcome depends on the
validity of underlying hypotheses. Hence the question arises whether and why measurement
outcomes are better suited to serve as evidence than other projective estimates, e.g. the
outputs of predictive computer simulations. Finally, the very idea that scientific
representation is a two-place relation – connecting abstract theories or models with concrete
objects and events – is significantly undermined by the model-based account. Under my
analysis, whether or not an idealized model adequately represents a measurement process is a
question whose answer is relative to the representational adequacy of other models with
respect to other measurement processes. Hence the model-based account implies a kind of
representational coherentism, i.e. a diffusion of representational adequacy conditions across
the entire web of instruments and knowledge claims. These implications of the model-based
account must nevertheless await elaboration elsewhere.
181
Bibliography
Arias, Elisa F., and Gérard Petit. 2005. “Estimation of the duration of the scale unit of TAI
with primary frequency standards.” Proceedings of the IEEE International Frequency Control
Symposium 244-6.
Audoin, Claude, and Bernard Guinot. 2001. The Measurement of Time. Cambridge: Cambridge
University Press.
Azoubib, J., Granveaud, M. and Guinot, B. 1977. “Estimation of the Scale Unit of Time
Scales.” Metrologia 13: 87-93.
BIPM (Bureau International des Poids et Measures). 2006. The International System of Units
(SI). 8th ed. Sèvres: BIPM, http://www.bipm.org/en/si/si_brochure/
———. 2010. BIPM Annual Report on Time Activities. Vol. 5. Sèvres: BIPM,
http://www.bipm.org/utils/en/pdf/time_ann_rep/Time_annual_report_2010.pdf
———. 2011. Circular-T 282. Sèvres: BIPM,
ftp://ftp2.bipm.org/pub/tai/publication/cirt.282
Birge, Raymond T. 1932. “The Calculation of Errors by the Method of Least Squares.”
Physical Review 40: 207-27.
Boumans, Marcel. 2005. How Economists Model the World into Numbers. London: Routledge.
———. 2006. “The difference between answering a ‘why’ question and answering a ‘how
much’ question.” In Simulation: Pragmatic Construction of Reality, edited by Johannes
Lenhard, Günter Küppers, and Terry Shinn, 107-124. Dordrecht: Springer.
———. 2007. “Invariance and Calibration.” In Measurement in Economics: A Handbook, edited
by Marcel Boumans, 231-248. London: Elsevier.
Bridgman, Percy W. 1927. The logic of modern physics. New York: MacMillan.
182
———. 1959. “P. W. Bridgman's "The Logic of Modern Physics" after Thirty Years”,
Daedalus 88 (3): 518-526.
Campbell, Norman R. 1920. Physics: the Elements. London: Cambridge University Press.
Carnap, Rudolf. (1966) 1995. An Introduction to the Philosophy of Science. Edited by Martin
Gardner. NY: Dover.
Cartwright, Nancy. 1999. The Dappled World: A Study of the Boundaries of Science. Cambridge:
Cambridge University Press.
Cartwright, Nancy, Towfic Shomar, and Mauricio Suárez. 1995. “The Tool Box of Science.”
Poznan Studies in the Philosophy of the Sciences and the Humanities 44: 137-49.
Chang, Hasok. 2004. Inventing Temperature: Measurement and Scientific Progress. Oxford University
Press.
———. 2009. "Operationalism." In The Stanford Encyclopedia of Philosophy, edited by E.N.
Zalta, http://plato.stanford.edu/archives/fall2009/entries/operationalism/
Chang, Hasok, and Nancy Cartwright. 2008. “Measurement.” In The Routledge Companion to
Philosophy of Science, edited by Psillos, S. and Curd, M., 367-375. NY: Routledge.
Chakravartty, Anjan. 2007. A metaphysics for scientific realism: knowing the unobservable. Cambridge
University Press.
CGPM (Conférence Générale des Poids et Mesures). 1961. Proceedings of the 11th CGPM.
http://www.bipm.org/en/CGPM/db/11/6/
Diez, Jose A. 2002. “A Program for the Individuation of Scientific Concepts.” Synthese 130:
13-48.
Duhem, Pierre M. M. (1914) 1991. The aim and structure of physical theory. Princeton University
Press.
183
Draper, David. 1995. “Assessment and Propagation of Model Uncertainty.” Journal of the
Royal Statistical Society. Series B (Methodological) 57 (1): 45-97.
Ellis, Brian. 1966. Basic Concepts of Measurement. Cambridge University Press.
Franklin, Allan. 1997. “Calibration.” Perspectives on Science 5 (1): 31-80.
Frigerio, Aldo, Alessandro Giordani, and Luca Mari. 2010. “Outline of a general model of
measurement.” Synthese 175: 123-149.
Galison, Peter. 2003. Einstein’s Clocks, Poincaré’s Maps: Empires of Time. W.W. Norton.
Gerginov, Vladislav, N. Nemitz, S. Weyers, R. Schröder, D. Griebsch, and R. Wynands.
2010. “Uncertainty evaluation of the caesium fountain clock PTB-CSF2.” Metrologia 47:
65-79.
Girard, G. 1994. “The Third Periodic Verification of National Prototypes of the Kilogram
(1988- 1992).” Metrologia 31: 317-36.
Gooday, Graeme J. N. 2004. The Morals of Measurement: Accuracy, Irony, and Trust in Late
Victorian Electrical Practice. Cambridge: Cambridge University Press.
Hacking, Ian. 1999. The Social Construction of What? Harvard University Press.
Henrion, Max and Baruch Fischhoff. .1986. “Assessing Uncertainty in Physical Constants.”
American Journal of Physics 54 (9): 791-8.
Heavner, T.P., S.R. Jefferts, E.A. Donley, J.H. Shirley and T.E. Parker. 2005. “NIST-F1:
recent improvements and accuracy evaluations.” Metrologia 42: 411-422.
Hempel, Carl G. 1966. Philosophy of Natural Science. NJ: Prentice-Hall.
Hon, Giora. 2009. “Error: The Long Neglect, the One-Sided View, and a Typology.” In
Going Amiss in Experimental Research, edited by G. Hon, J. Schickore and F. Steinle. Vol.
267 of Boston Studies in the Philosophy of Science, 11-26. Springer.
184
JCGM (Joint Committee for Guides in Metrology). 2008. International Vocabulary of Metrology.
3rd edition. Sèvres: JCGM, http://www.bipm.org/en/publications/guides/vim.html
———. 2008a. Guide to the Expression of Uncertainty in Measurement. Sèvres: JCGM,
http://www.bipm.org/en/publications/guides/gum.html
———. 2008b. Evaluation of measurement data — Supplement 1 to the ‘Guide to the expression of
uncertainty in measurement’— Propagation of distributions using a Monte Carlo method. Sèvres:
JCGM, http://www.bipm.org/en/publications/guides/gum.html
Jefferts, S.R., J. Shirley, T. E. Parker, T. P. Heavner, D. M. Meekhof, C. Nelson, F. Levi, G.
Costanzo, A. De Marchi, R. Drullinger, L. Hollberg, W. D. Lee and F. L. Walls. 2002.
“Accuracy evaluation of NIST-F1.” Metrologia 39: 321-36.
Jefferts, S.R., T. P. Heavner, T. E. Parker and J.H. Shirley. 2007. “NIST Cesium Fountains –
Current Status and Future Prospects.” Acta Physica Polonica A 112 (5): 759-67.
Krantz, D. H., P. Suppes, R. D. Luce, and A. Tversky. 1971. Foundations of measurement:
Additive and polynomial representations. Dover Publications.
Kripke, Saul A. 1980. Naming and Necessity. Harvard University Press.
Kuhn, Thomas S. (1961) 1977. “The Function of Measurement in Modern Physical
Sciences.” In The Essential Tension: Selected Studies in Scientific Tradition and Change, 178-
224. Chicago: University of Chicago Press.
Kyburg, Henry E. 1984. Theory and Measurement. Cambridge University Press.
Latour, Bruno. 1987. Science in Action. Harvard University Press.
Li, Tianchu et al. 2004. “NIM4 cesium fountain primary frequency standard: performance
and evaluation.” IEEE International Ultrasonics, Ferroelectrics, and Frequency Control, 702-5.
185
Lombardi, Michael A., Thomas P. Heavner and Steven R. Jefferts. 2007. “NIST Primary
Frequency Standards and the Realization of the SI Second.” Measure: The Journal of
Measurement Science 2 (4): 74-89.
Luce, R.D. and Suppes, P. 2002. “Representational Measurement Theory.” In Stevens'
Handbook of Experimental Psychology, 3rd Edition, edited by J. Wixted and H. Pashler,
Vol. 4: Methodology in Experimental Psychology, 1-41. New York: Wiley.
Luo, J. et al. 2009. “Determination of the Newtonian Gravitational Constant G with Time-
of-Swing Method.” Physical Review Letters 102 (24): 240801.
Mach, Ernst. (1896) 1966. “Critique of the Concept of Temperature.” In: Brian Ellis, Basic
Concepts of Measurement, 183-96. Cambridge University Press.
Mari, Luca. 2000. “Beyond the representational viewpoint: a new formalization of
measurement.” Measurement 27: 71-84.
———. 2005. “Models of the Measurement Process.” In Handbook of Measuring Systems
Design, edited by P. Sydenman and R. Thorn, Vol. 2, Ch. 104. Wiley.
McMullin, Ernan. 1985. “Galilean Idealization.” Studies in History and Philosophy of Science 16
(3): 247-73.
Michell, Joel. 1994. “Numbers as Quantitative Relations and the Traditional Theory of
Measurement.” British Journal for the Philosophy of Science 45: 389-406.
Morgan, Mary. 2007. “An Analytical History of Measuring Practices: The Case of Velocities
of Money.” In Measurement in Economics: A Handbook, edited by Marcel Boumans, 105-
132. London: Elsevier.
Morrison, Margaret. 1999. “Models as Autonomous Agents.” In Models as Mediators:
Perspectives on Natural and Social Science, edited by Mary Morgan and Margaret Morrison,
38-65. Cambridge: Cambridge University Press.
186
———. 2009. “Models, measurement and computer simulation: the changing face of
experimentation.” Philosophical Studies 143 (1): 33-57.
Morrison, Margaret, and Mary Morgan. 1999. “Models as Mediating Instruments.”, In Models
as Mediators: Perspectives on Natural and Social Science, edited by Mary Morgan and
Margaret Morrison, 10-37. Cambridge: Cambridge University Press.
Niering, M. et al. 2000. “Measurement of the Hydrogen 1S-2S Transition Frequency by
Phase Coherent Comparison with a Microwave Cesium Fountain Clock.” Physical
Review Letters 84(24): 5496.
Panfilo, G. and E.F. Arias. 2009. “Studies and possible improvements on EAL algorithm.”
IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control (UFFC-57), 154-160.
Parker, Thomas. 1999. “Hydrogen maser ensemble performance and characterization of
frequency standards.” Joint meeting of the European Frequency and Time Forum and the IEEE
International Frequency Control Symposium, 173-6.
Parker, T., P. Hetzel, S. Jefferts, S. Weyers, L. Nelson, A. Bauch, and J. Levine. 2001. “First
comparison of remote cesium fountains.” 2001 IEEE International Frequency Control
Symposium 63-68.
Petit, G. 2004. “A new realization of terrestrial time.” 35th Annual Precise Time and Time Interval
(PTTI) Meeting, 307-16.
Pickering, Andrew. 1995. The Mangle of Practice: Time, Agency and Science. Chicago and London:
University of Chicago Press.
Poincaré, Henri. (1898) 1958. “The Measure of Time.” In: The Value of Science, 26-36. New
York: Dover.
187
Quinn, T.J. 2003. “Open letter concerning the growing importance of metrology and the
benefits of participation in the Metre Convention, notably the CIPM MRA.”,
http://www.bipm.org/utils/en/pdf/importance.pdf
Record, Isaac. 2011. “Knowing Instruments: Design, Reliability and Scientific Practice.”
PhD diss., University of Toronto.
Reichenbach, Hans. (1927) 1958. The Philosophy of Space and Time. Courier Dover Publications.
Schaffer, Simon. 1992. “Late Victorian metrology and its instrumentation: a manufactory of
Ohms.” In Invisible Connections: Instruments, Institutions, and Science, edited by Robert Bud
and Susan E. Cozzens, 23-56. Cardiff: SPIE Optical Engineering.
Schwenke, H., B.R.L. Siebert, F. Waldele, and H. Kunzmann. 2000. “Assessment of
Uncertainties in Dimensional Metrology by Monte Carlo Simulation: Proposal for a
Modular and Visual Software.” CIRP Annals - Manufacturing Technology 49 (1): 395-8.
Suppes, Patrick. 1960. “A Comparison of the Meaning and Uses of Models in Mathematics
and the Empirical Sciences.” Synthese 12: 287-301.
———. 1962. “Models of Data.” In Logic, methodology and philosophy of science: proceedings of the
1960 International Congress, edited by Ernest Nagel, 252-261. Stanford University Press.
Swoyer, Chris. 1987. “The Metaphysics of Measurement.” In Measurement, Realism and
Objectivity, edited by John Forge, 235-290. Reidel.
Tal, Eran. 2011. “How Accurate Is the Standard Second?” Philosophy of Science 78 (5): 1082-
96.
Taylor, John R. 1997. An Introduction to Error Analysis: the Study of Uncertainties in Physical
Measurements. University Science Books.
Thomson, William. 1891. “Electrical Units of Measurement.” In: Popular Lectures and
Addresses, vol. 1, 25-88. London: McMillan.
188
Trenk, Michael, Matthias Franke and Heinrich Schwenke. 2004. “The ‘Virtual CMM’ a
software tool for uncertainty evaluation – practical application in an accredited
calibration lab.” Summer Proceedings of the American Society for Precision Engineering.
Tsai, M.J. and Hung, C.C. 2005. “Development of a high-precision surface metrology system
using structured light projection.” Measurement 38: 236-47.
van Fraassen, Bas C. 1980. The Scientific Image. Oxford: Clarendon Press.
———. 2008. Scientific Representation: Paradoxes of Perspective. Oxford University Press.
Wirandi, J. and Lauber, A. 2006. “Uncertainty and traceable calibration – how modern
measurement concepts improve product quality in process industry.” Measurement 39:
612-20.
Woodward, Jim. 1989. “Data and Phenomena.” Synthese 79: 393-472.