The Epistemology of Measurement: A Model-Based Account · I am especially thankful to Hasok Chang...

The Epistemology of Measurement:

A Model-Based Account

by

Eran Tal

A thesis submitted in conformity with the requirements

for the degree of Doctor of Philosophy

Graduate Department of Philosophy

University of Toronto

© Copyright by Eran Tal 2012

ii

The Epistemology of Measurement: A Model-Based Account

Eran Tal, Doctor of Philosophy

Department of Philosophy, University of Toronto, 2012

Thesis abstract

Measurement is an indispensable part of physical science as well as of commerce,

industry, and daily life. Measuring activities appear unproblematic when performed with

familiar instruments such as thermometers and clocks, but a closer examination reveals a

host of epistemological questions, including:

1. How is it possible to tell whether an instrument measures the quantity it is

intended to?

2. What do claims to measurement accuracy amount to, and how might such

claims be justified?

3. When is disagreement among instruments a sign of error, and when does it

imply that instruments measure different quantities?

Currently, these questions are almost completely ignored by philosophers of science,

who view them as methodological concerns to be settled by scientists. This dissertation

shows that these questions are not only philosophically worthy, but that their exploration

has the potential to challenge fundamental assumptions in philosophy of science, including

the distinction between measurement and prediction.

iii

The thesis outlines a model-based epistemology of physical measurement and uses it to

address the questions above. To measure, I argue, is to estimate the value of a parameter in

an idealized model of a physical process. Such estimation involves inference from the final

state (‘indication’) of a process to the value range of a parameter (‘outcome’) in light of

theoretical and statistical assumptions. Idealizations are necessary preconditions for the

possibility of justifying such inferences. Similarly, claims to accuracy, error and quantity

individuation can only be adjudicated against the background of an idealized representation

of the measurement process.

Chapters 1-3 develop this framework and use it to analyze the inferential structure of

standardization procedures performed by contemporary standardization bureaus.

Standardizing time, for example, is a matter of constructing idealized models of multiple

atomic clocks in a way that allows consistent estimates of duration to be inferred from clock

indications. Chapter 4 shows that calibration is a special sort of modeling activity, i.e. the

activity of constructing and testing models of measurement processes. Contrary to

contemporary philosophical views, the accuracy of measurement outcomes is properly

evaluated by comparing model predictions to each other, rather than by comparing

observations.

iv

Acknowledgements

In the course of writing this dissertation I have benefited time and again from the

knowledge, advice and support of teachers, colleagues and friends. I am deeply indebted to

Margie Morrison for being everything a supervisor should be and more: generous with her

time and precise in her feedback, unfailingly responsive and relentlessly committed to my

success. I thank Ian Hacking for his constant encouragement, for never ceasing to challenge

me, and for teaching me to respect the science and scientists of whom I write. I owe many

thanks to Anjan Chakravartty, who commented on several early proposals and many sketchy

drafts; this thesis owes its clarity to his meticulous feedback. My teaching mentor, Jim

Brown, has been a constant source of friendly advice on all academic matters since my very

first day in Toronto, for which I am very grateful.

In addition to my formal advisors, I have been fortunate enough to meet faculty members in

other institutions who have taken an active interest in my work. I am grateful to Stephan

Hartmann for the three wonderful months I spent as a visiting researcher at Tilburg

University; to Allan Franklin for ongoing feedback and assistance during my visit to the

University of Colorado; to Paul Teller for insightful and detailed comments on virtually the

entire dissertation; and to Marcel Boumans, Wendy Parker, Léna Soler, Alfred Nordmann

and Leah McClimans for informal mentorship and fruitful research collaborations.

Many other colleagues and friends provided useful comments on this thesis at various stages

of writing, of which I can only mention a few. I owe thanks to Giora Hon, Paul Humphreys,

Michela Massimi, Luca Mari, Carlo Martini, Ave Mets, Boaz Miller, Mary Morgan, Thomas

Müller, John Norton, Isaac Record, Jan Sprenger, Jacob Stegenga, Jonathan Weisberg,

Michael Weisberg, Eric Winsberg, and Jim Woodward, among many others.

I am especially thankful to Hasok Chang for writing a thoughtful and detailed appraisal of

this dissertation, and to Joseph Berkovitz and Denis Walsh for serving on my examination

committee.

v

The work presented here depended on numerous physicists who were kind enough to meet

with me, show me around their labs and answer my often naive questions. I am grateful to

members of the Time and Frequency Division at the US National Institute of Standards and

Technology (NIST) and JILA labs in Boulder, Colorado for their helpful cooperation. The

long hours I spent in conversation with Judah Levine introduced me to the fascinating world

of atomic clocks and ultimately gave rise to the central case studies reported in this thesis.

David Wineland’s invitation to visit the laboratories of the Ion Storage Group at NIST in

summer 2009 resulted in a wealth of materials for this dissertation. I am also indebted to

Eric Cornell, Till Rosenband, Scott Diddams, Tom Parker and Tom Heavner for their time

and patience in answering my questions. Special thanks go to Chris Ellenor and Rockson

Chang, who, as graduate students in Aephraim Steinberg’s laboratory in Toronto, spent

countless hours explaining to me the technicalities of Bose-Einstein Condensation.

My research for this dissertation was supported by several grants, including three Ontario

Graduate Scholarships, a Chancellor Jackman Graduate Fellowship in the Humanities, a

School of Graduate Studies Travel Grant (the latter two from the University of Toronto),

and a Junior Visiting Fellowship at Tilburg University.

I am indebted to Gideon Freudenthal, my MA thesis supervisor, whose enthusiasm for

teaching and attention to detail inspired me to pursue a career in philosophy.

My mother, Ruth Tal, has been extremely supportive and encouraging throughout my

graduate studies. I deeply thank her for enduring my infrequent visits home and the

occasional cold Toronto winter.

Finally, to my partner, Cheryl Dipede, for suffering through my long hours of study with

only support and love, and for obligingly jumping into the unknown with me, thanks for

being you.

vi

Table of Contents

Introduction ...................................................................................................... 1

1. Measurement and knowledge........................................................................... 1 2. The epistemology of measurement ................................................................. 3 3. Three epistemological problems...................................................................... 5

The problem of coordination.................................................................... 8 The problem of accuracy ......................................................................... 11 The problem of quantity individuation.................................................. 12 Epistemic entanglement........................................................................... 14

4. The challenge from practice........................................................................... 15 5. The model-based account............................................................................... 17 6. Methodology..................................................................................................... 21 7. Plan of thesis .................................................................................................... 24

1. How Accurate is the Standard Second? ................................................... 26

1.1. Introduction...................................................................................................... 26 1.2. Five notions of measurement accuracy ........................................................ 29 1.3. The multiple realizability of unit definitions................................................ 33 1.4. Uncertainty and de-idealization ..................................................................... 37 1.5. A robustness condition for accuracy ............................................................ 40 1.6. Future definitions of the second ................................................................... 44 1.7. Implications and conclusions......................................................................... 46

2. Systematic Error and the Problem of Quantity Individuation ................ 48

2.1. Introduction...................................................................................................... 48 2.2. The problem of quantity individuation ........................................................ 51

2.2.1. Agreement and error................................................................... 51 2.2.2. The model-relativity of systematic error .................................. 55 2.2.3. Establishing agreement: a threefold condition........................ 59 2.2.4. Underdetermination.................................................................... 62 2.2.5. Conceptual vs. practical consequences .................................... 64

2.3. The shortcomings of foundationalism ......................................................... 67 2.3.1. Bridgman’s operationalism......................................................... 68 2.3.2. Ellis’ conventionalism................................................................. 70 2.3.3. Representational Theory of Measurement .............................. 73

2.4. A model-based account of measurement..................................................... 78 2.4.1. General outline ............................................................................ 78 2.4.2. Conceptual quantity individuation............................................ 83 2.4.3. Practical quantity individuation ................................................. 88

2.5. Conclusion: error as a conceptual tool ......................................................... 91

vii

3. Making Time: A Study in the Epistemology of Standardization ............ 93

3.1. Introduction...................................................................................................... 93 3.2. Making time universal ..................................................................................... 99

3.2.1. Stability and accuracy.................................................................. 99 3.2.2. A plethora of clocks..................................................................103 3.2.3. Bootstrapping reliability ...........................................................106 3.2.4. Divergent standards ..................................................................108 3.2.5. The leap second.........................................................................111

3.3. The two faces of stability..............................................................................112 3.3.1. An explanatory challenge .........................................................112 3.3.2. Conventionalist explanations...................................................113 3.3.3. Constructivist explanations......................................................118

3.4. Models and coordination..............................................................................123 3.4.1. A third alternative......................................................................123 3.4.2. Mediation, legislation, and models..........................................126 3.4.3. Coordinative freedom...............................................................130

3.5. Conclusions ....................................................................................................136

4. Calibration: Modeling the Measurement Process ..................................138

4.1. Introduction....................................................................................................138 4.2. The products of calibration..........................................................................142

4.2.1. Metrological definition .............................................................142 4.2.2. Indications vs. outcomes..........................................................143 4.2.3. Forward and backward calibration functions........................146

4.3. Black-box calibration.....................................................................................148 4.4. White-box calibration....................................................................................151

4.4.1. Model construction ...................................................................151 4.4.2. Uncertainty estimation..............................................................154 4.4.3. Projection ...................................................................................158 4.4.4. Predictability, not just correlation...........................................160

4.5. The role of standards in calibration ............................................................164 4.5.1. Why standards?..........................................................................164 4.5.2. Two-way white-box calibration...............................................165 4.5.3. Calibration without metrological standards...........................168 4.5.4. A global perspective..................................................................170

4.6. From predictive uncertainty to measurement accuracy ...........................174 4.7. Conclusions ....................................................................................................177

Epilogue .........................................................................................................178

Bibliography................................................................................................... 181

viii

List of Tables

Table 1.1: Comparison of uncertainty budgets of aluminum (Al) and mercury (Hg) optical atomic clocks. ............................................................. 45

Table 3.1: Excerpt from Circular-T............................................................................... 104 Table 4.1: Uncertainty budget for a torsion pendulum measurement of G,

the Newtonian gravitational constant......................................................... 156 Table 4.2: Type-B uncertainty budget for NIST-F1, the US primary

frequency standard......................................................................................... 166

List of Figures

Figure 3.1: A simplified hierarchy of approximations among model parameters in contemporary timekeeping..................................................129

Figure 4.1: Modules and parameters involved in a white-box calibration of a simple caliper..................................................................................................153

Figure 4.2: A simplified diagram of a round-robin calibration scheme .....................169

1


Introduction

I often say that when you can measure what you are speaking about and express

it in numbers you know something about it; but when you cannot measure it,

when you cannot express it in numbers, your knowledge is of a meagre and

unsatisfactory kind […].

– William Thomson, Lord Kelvin (1891, 80)

1. Measurement and knowledge

Measurement is commonly seen as a privileged source of scientific knowledge. Unlike

qualitative observation, measurement enables the expression of empirical claims in

mathematical form and hence makes possible an exact description of nature. Lord Kelvin’s

famous remark expresses high esteem for measurement for this same reason. Today, in an

age when thermometers and ammeters produce stable measurement outcomes on familiar

scales, Kelvin’s remark may seem superfluous. How else could one gain reliable knowledge

of temperature and electric current other than through measurement? But the quantities

called ‘temperature’ and ‘current’ as well as the instruments that measure them have long

2

histories during which it was far from clear what was being measured and how – histories in

which Kelvin himself played important roles1.

These early struggles to find principled relations between the indications of material

instruments and values of abstract quantities illustrate the dual nature of measurement. On

the one hand, measurement involves the design, execution and observation of a concrete

physical process. On the other hand, the outcome of a measurement is a knowledge claim

formulated in terms of some abstract and universal concept – e.g. mass, current, length or

duration. How, and under what conditions, are such knowledge claims warranted on the

basis of material operations?

Answering this last question is crucial to understating how measurement produces

knowledge. And yet contemporary philosophy of measurement offers little by way of an

answer. Epistemological concerns about measurement were briefly popular in the 1920s

(Campbell 1920, Bridgman 1927, Reichenbach [1927] 1958) and again in the 1960s (Carnap

[1966] 1995, Ellis 1966), but have otherwise remained in the background of philosophical

discussion. Until less than a decade ago, the philosophical literature on measurement focused

on either the metaphysics of quantities (Swoyer 1987, Michell 1994) or the mathematical

structure of measurement scales. The Representational Theory of Measurement (Krantz et al

1971), for example, confined itself to a discussion of structural mappings between empirical

and quantitative domains and neglected the possibility of telling what, and how accurately,

such mappings measure. It is only in the last several years that a new wave of philosophical

writings about the epistemology of measurement has appeared (most notably Chang 2004,

Boumans 2006, 2007 and van Fraassen 2008, Ch. 5-7). Partly drawing on these recent

1 See Chang (2004, 173-186) and Gooday (2004, 2-9).

3

achievements, this thesis will offer a novel systematic account of the ways in which

measurement produces knowledge.

2. The epistemology of measurement

The epistemology of measurement, as envisioned in this dissertation, is a subfield of

philosophy concerned with the relationships between measurement and knowledge. Central

topics that fall under its purview are the conditions under which measurement produces

knowledge; the content, scope, justification and limits of such knowledge; the reasons why

particular methodologies of measurement and standardization succeed or fail in supporting

particular knowledge claims; and the relationships between measurement and other

knowledge-producing activities such as observation, theorizing, experimentation, modeling

and calculation. The pursuit of research into these topics is motivated not only by the need

to clarify the epistemic functions of measurement, but also by the prospects of contributing

to other areas of philosophical discussion concerning e.g. reliability, evidence, causality,

objectivity, representation and information.

As measurement is not exclusively a scientific activity – it plays vital roles in

engineering, medicine, commerce, public policy and everyday life – the epistemology of

measurement is not simply a specialized branch of philosophy of science. Instead, the

epistemology of measurement is a subfield of philosophy that draws on the tools and

concepts of traditional epistemology, philosophy of science, philosophy of language,

philosophy of technology and philosophy of mind, among other subfields. It is also a multi-

4

disciplinary subfield, ultimately engaging with measurement techniques from a variety of

disciplines as well as with the histories and sociologies of those disciplines.

The goal of providing a comprehensive epistemological theory of measurement is

beyond the scope of a single doctoral dissertation. This thesis is cautiously titled ‘account’

rather than ‘theory’ in order to signal a more modest intention: to argue for the plausibility

of a particular approach to the epistemology of measurement by demonstrating its strengths

in a specific domain. I call my approach ‘model-based’ because it tackles epistemological

challenges by appealing to abstract and idealized models of measurement processes. As I will

explain below, this thesis constitutes the first systematic attempt to bring insights from the

burgeoning literature on the philosophy of scientific modeling to bear on traditional

problems in the philosophy of measurement. The specific domain I will focus on is physical

metrology, officially defined as “the science of measurement and its application”2. Metrologists

are the physicists and engineers who design and standardize measuring instruments for use

in scientific and commercial applications, and often work at standardization bureaus or

specially accredited laboratories.

The immediate aim of this dissertation, then, is to show that a model-based approach

to measurement successfully solves certain epistemological challenges in the domain of

physical metrology. By achieving this aim, a more far-reaching goal will also be

accomplished, namely, a demonstration of the importance of research into the epistemology

of measurement and of the promise held by model-based approaches for further research in

this area.

2 JCGM 2008, 2.2.

5

The epistemological challenges addressed in this thesis may be divided into two kinds.

The first kind consists of abstract and general epistemological problems that pertain to any

sort of measurement, whether physical or nonphysical (e.g. of social or mental quantities). I

will address three such problems: the problem of coordination, the problem of accuracy, and

the problem of quantity individuation. These problems will be introduced in the next

section. The second kind of epistemological challenge consists of problems that are specific

to physical metrology. These problems arise from the need to explain the efficacy of

metrological methods for solving problems of the first sort - for example, the efficacy of

metrological uncertainty evaluations in overcoming the problem of accuracy. After

discussing these ‘challenges from practice’, I will introduce the model-based account,

explicate my methodology and outline the plan of this thesis.

3. Three epistemological problems

This thesis will address three general epistemological problems related to

measurement, which arise when one attempts to answer the following three questions:

1. Given a procedure P and a quantity Q, how is it possible to tell whether P

measures Q?

2. Assuming that procedure P measures quantity Q, how is it possible to tell how

accurately P measures Q?

6

3. Assuming that P and P′ are two measuring procedures, how is it possible to

tell whether P and P′ measure the same quantity?

Each of these three questions pertains to the possibility of obtaining knowledge of

some sort about the relationship between measuring procedures and the quantities they

measure. The sort of possibility I am interested in is not a general metaphysical or epistemic

one – I do not consider the existence of the world or the veridical character of perception as

relevant answers to the questions above. Rather, I will be interested in possibility in the

practical, technological sense. What is technologically possible is what humans can do with

the limited cognitive and material resources they have at their disposal and within reasonable

time3. Hence to qualify as an adequate answer to the questions above, a condition of

possibility must be cognitively accessible through one or more empirical tests that humans may

reasonably be expected to perform. For example, an adequate answer to the first question

would specify the sort of evidence scientists are required to collect in order to test whether

an instrument is a thermometer – i.e. whether or not it measures temperature – as well as

general considerations that apply to the analysis of this evidence.

An obvious worry is that such conditions are too specific and can only be supplied on

a case-by-case basis. This worry would no doubt be justified if one were to seek particular

test specifications or ‘experimental recipes’ in response to the questions above. No single

test, nor even a small set of tests, exist that can be applied universally to any measuring

procedure and any quantity to yield satisfactory answers to the questions above. But this

worry is founded on an overly narrow interpretation of the questions’ scope. The conditions

3 For an elaboration of the notion of technological possibility see Record (2011, Ch. 2).

7

of possibility sought by the questions above are not empirical test specifications but only

general formal constraints on such specifications. These formal constraints, as we shall see,

pertain to the structure of inferences involved in such tests and to general representational

preconditions for performing them. Of course, it is not guaranteed in advance that even

general constraints of this sort exist. If they do not, knowledge claims about measurement,

accuracy and quantity individuation would have no unifying grounds. Yet at least in the case

of physical quantities, I will show that a shared inferential and representational structure

indeed underlies the possibly of knowing what, and how accurately, one is measuring.

Another, sceptical sort of worry is that the questions above may have no answer at all,

because it may in fact be impossible to know whether and how accurately any given

procedure measures any quantity. I take this worry to be indicative of a failure in

philosophical methodology rather than an expression of a cautious approach to the

limitations of human knowledge. The terms “measurement”, “quantity” and “accuracy”

already have stable (though not necessarily unique) meanings set by their usage in scientific

practice. Claims to measurement, accuracy and quantity individuation are commonly made in

the sciences based on these stable meanings. The job of epistemologists of measurement, as

envisioned in this thesis, is to clarify these meanings and make sense of scientific claims

made in light of such meanings. In some cases the epistemologist may conclude that a

particular scientific claim is unfounded or that a particular scientific method is unreliable.

But the conclusion that all claims to measurement are unfounded is only possible if

philosophers create perverse new meanings for these terms. For example, the idea that

measurement accuracy is unknowable in principle cannot be seriously entertained unless the

meaning of “accuracy” is detached from the way practicing metrologists use this term, as will

be shown in Chapter 1. I will elaborate further on the interplay between descriptive and

8

normative aspects of the epistemology of measurement when I discuss my methodology

below.

As mentioned, the attempt to answer the three questions above gives rise to three

epistemological problems: the problem of coordination, the problem of accuracy and the

problem of quantity individuation, respectively. The next three subsections will introduce

these problems, and the fourth subsection will discuss their mutual entanglement.

The problem of coordination

How can one tell whether a given empirical procedure measures a given quantity? For

example, how can one tell that an instrument is a thermometer, i.e. that the procedure of its

use results in estimates of temperature? The answer is clear enough if one is allowed to

presuppose, as scientists do today, an accepted theory of temperature along with accepted

standards for measuring temperature. The epistemological conundrum arises when one

attempts to explain the possibility of establishing such theories and standards in the first

place. To establish a theory of temperature one has to be able to test its predictions

empirically, a task which requires a reliable method of measuring temperature; but

establishing such method requires prior knowledge of how temperature is related to other

quantities, e.g. volume or pressure, and this can only be settled by an empirically tested

theory of temperature. It appears to be impossible to coordinate the abstract notion of

temperature to any concrete method of measuring temperature without begging the

question.

9

The problem of coordination was discussed by Mach ([1896] 1966) in his analysis of

temperature measurement and by Poincaré ([1898] 1958) in relation to the measurement of

space and time. Both authors took the view that the choice of coordinative principles is

arbitrary and motivated by considerations of simplicity. Which substance is taken to expand

uniformly with temperature, and which kind of clock is taken to ‘tick’ at equal time intervals,

are choices based of convenience rather than observation. The conventionalist solution was

later generalized by Reichenbach ([1927] 1958), Carnap ([1966] 1995) and Ellis (1966), who

understood such coordinative principles (or ‘correspondence rules’) as a priori definitions

that are in no need of empirical verification. Rather than statements of fact, such principles

of coordination were viewed as semantic preconditions for the possibility of measurement.

However, unlike ‘ordinary’ conceptual definitions, conventionalists maintained that

coordinative definitions do not fully determine the meaning of a quantity concept but only

regulates its use. For example, what counts as an accurate measurement of time depends on

which type of clock is chosen to regulate the application of the notion of temporal

uniformity. But the extension of the notion of uniformity is not limited to that particular

type of clock. Other types of clock may be used to measure time, and their accuracy is

evaluated by empirical comparison to the conventionally chosen standard4.

Another approach to the problem of coordination, closely aligned with but distinct

from conventionalism, was defended by Bridgman (1927). Bridgman’s initial proposal was to

define a quantity concept directly by the operation of its measurement, so that strictly

speaking two different types of operation necessarily measure different quantities. The

4 See, for example, Carnap on the periodicity of clocks (1995 [1966], 84). For a discussion of the differences between operationalism and conventionalism see Chang and Cartwright (2008, 368.)

10

operationalist solution is more radical than conventionalism, as it reduces the meaning of a

quantity concept to its operational definition. Bridgman motivated this approach by the need

to exercise caution when applying what appears to be the same quantity concept across

different domains. Bridgman later modified his view in response to various criticisms and no

longer viewed operationalism as a comprehensive theory of meaning (Bridgman 1959,

Chang 2009, 2.1).

A new strand of writing on the problem of coordination has emerged in the last

decade, consisting most notably of the works of Chang (2004) and van Fraassen (2008, Ch.

5). These works take a historical-contextual and coherentist approach to the problem. Rather

than attempt a solution from first principles, these writers appeal to considerations of

coherence and consistency among different elements of scientific practice. The process of

theory-construction and standardization is seen as mutual and iterative, with each iteration

respecting existing traditions while at the same time correcting them. At each such iteration

the quantity concept is re-coordinated to a more robust set of standards, which in turn

allows theoretical predictions to be tested more accurately, etc. The challenge for these

writers is not to find a vantage point from which coordination is deemed rational a priori,

but to trace the inferential and material apparatuses responsible for the mutual refinement of

theory and measurement in any specific case. Hence they reject the traditional question:

‘what is the general solution to the problem of coordination?’ in favour of historically

situated, local investigations.

As will become clear, my approach to the problem of coordination continues the

historical-contextual and coherentist trend in recent scholarship, but at the same time seeks

to specify general formal features common to successful solutions to this problem. Rather

than abandon traditional approaches to the problem altogether, my aim will be to shed new

11

light on, and ultimately improve upon, conventionalist and operationalist attempts to solve

the problem of coordination. To this end I will provide a novel account of what it means to

coordinate quantity concepts to physical operations – an account in which coordination is

understood as a process rather than a static definition – and clarify the conventional and

empirical aspects of this process.

The problem of accuracy

Even if one can safely assume that a given procedure measures the quantity it is

intended to, a second problem arises when one tries to evaluate the accuracy of that

procedure. Quantities such as length, duration and temperature, insofar as they are

represented by non-integer (e.g. rational or real) numbers, cannot be measured with

complete accuracy. Even measurements of integer-valued quantities, such as the number of

alpha-particles discharged in a radioactive reaction, often involve uncertainties. The accuracy

of measurements of such quantities cannot, therefore, be evaluated by reference to exact

values but only by comparing uncertain estimates to each other. Such comparisons by their

very nature cannot determine the extent of error associated with any single estimate but only

overall mutual compatibility among estimates. Hence multiple ways of distributing errors

among estimates are possible that are all consistent with the evidence gathered through

12

comparisons. It seems that claims to accuracy are intrinsically underdetermined by any

possible evidence5.

Many of the authors who have discussed the problem of coordination appear to have

also identified the problem of accuracy, although they have not always distinguished the two

very clearly. Often, as in the cases of Mach, Ellis and Carnap, they naively believed that

fixing a measurement standard in an arbitrary manner is sufficient to solve both problems at

once. However, measurement standards are physical instruments whose construction,

maintenance, operation and comparison suffer from uncertainties just like those of other

instruments. As I will show, the absolute accuracy of measurement standards is nothing but

a myth that obscures the complexity behind the problem of accuracy. Indeed, I will argue

that the role played by standards in the evaluation of measurement accuracy has so far been

grossly misunderstood by philosophers. Once the epistemic role of standards is clarified,

new and important insights emerge not only with respect to the proper solution to the

problem of accuracy but also with respect to the other two problems.

The problem of quantity individuation

When discussing the previous two problems I implicitly assumed that it is possible to

tell whether multiple measuring procedures, compared to each other either synchronically or

diachronically, measure the same quantity. But this assumption quickly leads to another

underdetermination problem, which I call the ‘problem of quantity individuation.’ Even

5 See also Kyburg (1984, 183)

13

when two different procedures are thought to measure the same quantity, their outcomes

rarely exactly coincide under similar conditions. Therefore when the outcomes of two

procedures appear to disagree with each other two kinds of explanation are open to

scientists: either one (or both) of the procedures are inaccurate, or the two procedures

measure different quantities6. But any empirical test that may be brought to bear on this

dilemma necessarily presupposes additional facts about agreement or disagreement among

measurement outcomes and merely duplicates the problem. Much like claims about

accuracy, claims about quantity individuation are underdetermined by any possible evidence.

As Chapter 2 will make clear, existing philosophical accounts of quantity individuation

do not fully acknowledge the import of the problem. Bridgman and Ellis, for example, both

acknowledge that claims to quantity individuation are underdetermined by facts about

agreement and disagreement among measuring instruments. And yet they fail to notice that

facts about agreement and disagreement among measuring instruments are themselves

underdetermined by the indications of those instruments. Once this additional level of

underdetermination is properly appreciated, Bridgman and Ellis’ proposed criteria of

quantity individuation are exposed as question-begging. A proper solution to the problem of

quantity individuation, I will argue, is possible only if one takes into account its

entanglement with the first two problems.

6 This second option may be further subdivided into sub-options. The two procedures may be measuring different quantity tokens of the same type, e.g. lengths of different objects, or two different types of quantity altogether, e.g. length and area.

14

Epistemic entanglement

Though conceptually distinct, I will argue that the three problems just mentioned are

epistemically entangled, i.e. that they cannot be solved independently of one another.

Specifically, I will show that (i) it is impossible to test whether a given procedure P measures

a given quantity Q without at the same time testing how accurately procedure P would

measure quantity Q; (ii) it is impossible to test how accurately procedure P would measure

quantity Q without comparing it to some other procedure P′ that is supposed to measure Q;

and (iii) it is impossible to test whether P and P′ measure the same quantity without at the

same time testing whether they measure some given quantity, e.g. Q. Note that these

‘impossibility theses’ are epistemic rather than logical. For example, it is logically possible to

know that two procedures measure the same quantity without knowing which quantity they

measure7. Nevertheless, it is epistemically impossible to test whether two procedures

measure the same quantity without making substantive assumptions about the quantity they

are supposed to measure.

The extent and consequences of this epistemic entanglement have hitherto remained

unrecognized by philosophers, despite the fact that some of the problems themselves have

been widely acknowledged for a long time. The model-based account presented here is the

first epistemology of measurement to clarify how it is possible in general to solve all three

problems simultaneously without getting caught in a vicious circle.

7 The opposite is not the case, of course: one cannot (logically speaking) know which quantities two procedures measure without knowing whether they measure the same quantity. Questions 1 and 3 are therefore logically related, but not logically equivalent.

15

4. The challenge from practice

Apart from solving abstract and general problems like those discussed in the previous

section, a central challenge for the epistemology of measurement is to make sense of specific

measurement methods employed in particular disciplines. Indeed, it would be of little value

to suggest a solution to the abstract problems that has no bearing on scientific practice, as

such solution would not be able to clarify whether and how accepted measurement methods

actually overcome these problems. The ‘challenge from practice’, then, is to shed light on the

epistemic efficacy of concrete methodologies of measurement and standardization. How do

such methods overcome the three general epistemological problems discussed above? As

already mentioned, this thesis will focus on the standardization of physical measuring

instruments. Physical metrology involves a variety of methods for instrument comparison,

error detection and correction, uncertainty evaluation and calibration. These methods

employ theoretical and statistical tools as well as techniques of experimental manipulation

and control. A central desideratum for the plausibility of the model-based account will be its

ability to explain how, and under what conditions, these methods support knowledge claims

about measurement, accuracy and quantity individuation.

As my focus will be on physical metrology, I will pay special attention to the

methodological guidelines developed by practitioners in that field. In particular, I will

frequently refer to two documents published in 2008 by the Joint Committee for Guides in

Metrology (JCGM), a committee that represents eight leading international standardization

16

bodies8. The first document is the International Vocabulary of Metrology – Basic and General

Concepts and Associated Terms (VIM), 3rd edition (JCGM 2008)9. This document contains

definitions and clarificatory remarks for dozens of key concepts in metrology such as

calibration, measurement accuracy, measurement precision and measurement standard.

These definitions shed light on the way practitioners understand these concepts and on their

underlying (and sometimes conflicting) epistemic and metaphysical commitments. The

second document is titled Evaluation of Measurement Data — Guide to the Expression of

Uncertainty in Measurement (GUM), 1st edition (JCGM 2008a). This document provides

detailed guidelines for evaluating measurement uncertainties and for comparing the results

of different measurements. Together these two documents portray a methodological picture

of metrology in which abstract and idealized representations of measurement processes play

a central role. However, being geared towards regulating practice, these documents do not

explicitly analyze the presuppositions underlying this methodological picture nor its efficacy

for overcoming general epistemological conundrums that are of interest to philosophers. It

is this gap between methodology and epistemology that the model-based account of

measurement is intended to fill.

8 The JCGM is composed of representatives from the International Bureau of Weights and Measures (BIPM), the International Electrotechnical Commission (IEC), the International Federation of Clinical Chemistry and Laboratory Medicine (IFCC), the International Laboratory Accreditation Cooperation (ILAC), the International Organization for Standardization (ISO), the International Union of Pure and Applied Chemistry (IUPAC), the International Union of Pure and Applied Physics (IUPAP) and the International Organization of Legal Metrology (OIML).

9 A new version of the 3rd edition of the VIM with minor changes was published in early 2012. My discussion in this thesis applies equally to this new version.

17

5. The model-based account

According to the model-based account, a necessary precondition for the possibility of

measuring is the specification of an abstract and idealized model of the measurement process. To

measure a physical quantity is to make coherent and consistent inferences from the final

state(s) of a physical process to value(s) of a parameter in the model. Prior to the

subsumption of a process under some idealized assumptions, it is impossible to ground such

inferences and hence impossible to obtain a measurement outcome. Rather than be given by

observation, measurement outcomes are sensitive to the assumptions with which a

measurement process is modelled and may change when these assumptions change. The

same holds true for estimates of measurement uncertainty, accuracy and error, as well as for

judgements about agreement and disagreement among measurement outcomes – all are

relative to the assumptions under which the relevant measurement processes are modelled.

My conception of the nature and functions of models follows closely the views

expressed in Morrison and Morgan (1999), Morrison (1999), Cartwright et al. (1995) and

Cartwright (1999). I take a scientific model to be an abstract representation of some local

phenomenon, a representation that is used to predict and explain aspects of that

phenomenon. A model is constructed out of assumptions about the ‘target’ phenomenon

being represented. These assumptions may include laws and principles from one or more

theories, empirical generalizations from available data, statistical assumptions about the data,

and other local (and sometimes ad hoc) simplifying assumptions about the phenomenon of

interest. The specialized character of models allows them to function autonomously from

the theories that contributed to their construction, and to mediate between the highly

abstract assumptions of theory and concrete phenomena. I view models as instruments that

18

are more or less useful for purposes of prediction, explanation, experimental design and

intervention, rather than as descriptions that are true or false.

Though not committed to any particular view on how models represent the world, the

model-based account does not require models to mirror the structure of their target systems

in order to be successful representational instruments. My framework therefore differs from

the ‘semantic’ view, which takes models to be set-theoretical relational structures that are

isomorphic to relations among objects in the target domain (Suppes 1960, van Fraassen

1980, 41-6). The model-based account is also permissive with respect to the ontology of

models, and apart from assuming that models are abstract constructs I do not presuppose

any particular view concerning their nature (e.g. abstract entities, mathematical objects,

fictions). I do, however, take models to be non-linguistic entities and hence different from

the equations used to express their assumptions and consequences10.

The epistemic functions of models have received far less attention in the context of

measurement than in other contexts where models are used to produce knowledge, e.g.

theory construction, prediction, explanation, experimentation and simulation. An exception

to this general neglect is the use of models for measurement in economics, a topic about

which philosophers have gained valuable insights in recent years (Boumans 2005, 2006,

2007; Morgan 2007). The Representational Theory of Measurement (Krantz et al 1971)

appeals to models in the set-theoretical sense to elucidate the adequacy of different types of

scales, but completely neglects epistemic questions concerning coordination, accuracy and

quantity individuation. This thesis will focus on the epistemic functions of models in

10 On this last point my terminology is at odds with that of the VIM, which defines a measurement model as a set of equations. Cf. JCGM 2008, 2.48 “Measurement Model”, p. 32.

19

physical measurement, a topic on which relatively little has been written, and to date no

systematic account has been offered11.

The models I will discuss represent measurement processes. Such processes have physical

and symbolic aspects. The physical aspect of a measurement process, broadly construed,

includes interactions between a measuring instrument, one or more measured samples, the

environment and human operators. The symbolic aspect includes data processing operations

such as averaging, data reduction and error correction. The primary function of models of

measurement processes is to represent the final states – or ‘indications’ – of the process in

terms of values of the measured quantity. For example, the primary function of a model of a

cesium fountain clock is to represent the output frequency of the clock (the frequency of its

‘ticks’) in terms of the ideal frequency associated with a specific hyperfine transition in

cesium-133. To do this, the model of the clock must incorporate theoretical and statistical

assumptions about the working of the clock and its interactions with the cesium sample and

the environment, as well as about the processing of the output frequency signal.

A measurement procedure is a measurement process as represented under a particular set

of modeling assumptions. Hence multiple procedures may be instantiated on the basis of the

same measurement process when the latter is represented with different models12. For

example, the same interactions among various parts of a cesium fountain clock and its

11 But see important contributions to this topic by Morrison (2009) and Frigerio, Giordani and Mari (2010).

12 Here too I have chosen to slightly deviate from the terminology of the VIM, which defines a measurement procedure as a description of a measurement process that is based on a measurement model (JCGM 2008, 2.6, p. 18). I use the term, by contrast, to denote a measurement process as represented by a measurement model. The difference is that in the VIM definition a procedure does not itself measure but only provides instructions on how to measure, whereas in my definition a procedure measures.

20

environment may instantiate different procedures for measuring time when modelled with

different assumptions.

According to the model-based account, knowledge claims about coordination,

accuracy and quantity individuation are properly ascribable to measurement procedures rather

than to measurement processes. That is, such knowledge claims presuppose that the

measurement process in question is already subsumed under specific idealized assumptions,

and may therefore be judged as true or false only relative to those assumptions. The central

reason for this model-relativity is that prior to the subsumption of a measurement process

under a model it is impossible to warrant objective claims about the outcomes of

measurement, that is, claims that reasonably ascribe the outcome to the object being measured

rather than to idiosyncrasies of the procedure. This will be explained in detail in Chapter 2.

As I will argue, the model-based account meets both the abstract and practice-based

challenges I have discussed. Once the inferential grounds of measurement claims are

relativized to a representational context, it becomes clear how all three epistemic problems

mentioned above may be solved simultaneously. Moreover, it becomes clear how

contemporary metrological methods of standardization, calibration and uncertainty

evaluation are able to solve these problems, and what practical considerations and trade-offs

are involved in the application of such methods. Finally, it becomes clear why measurement

outcomes retain their objective validity outside the representational context in which they are

obtained, thereby avoiding problems of incommensurability across different measuring

procedures.

In providing a model-based epistemology of measurement, I intend to offer neither a

critique nor an endorsement of metaphysical realism with respect to measurable quantities.

My account remains agnostic with respect to metaphysics and pertains to measurement

21

solely as an epistemic activity, i.e. to the inferences and assumptions that make it possible to

warrant knowledge claims by operating measuring instruments. For example, nothing in my

account depends on whether or not ratios of mass (or length, or duration) exist mind-

independently. Indeed, in Chapters 1 and 4 I show that the problem of accuracy is solved in

exactly the same way regardless of whether one interprets measurement uncertainties as

deviations from true quantity values or as estimates of the degree of mutual consistency

among the consequences of different models. The question of realism with respect to

measurable quantities is therefore independent of the epistemology of measurement and

underdetermined by any evidence one can gather from the practice of measuring.

6. Methodology

As I have mentioned, the model-based account is designed to meet both general

epistemological challenges and challenges from practice. These two sorts of challenge may

be distinguished along the lines of a normative-descriptive divide and formulated as two

questions:

1. Normative question: what are some of the formal desiderata for an adequate

solution to the problems of coordination, accuracy and quantity

individuation?

2. Descriptive question: do the methods employed in physical metrology satisfy

these desiderata?

22

It is tempting to try to answer these questions separately – first by analyzing the

abstract problems and arriving at formal desiderata for their solution, and then by surveying

metrological methods for compatibility with these desiderata. But on a closer look it

becomes clear that these two questions cannot be answered completely independently of

each other. Much like the first-order problems, these questions are entangled. On the one

hand, overly strict normative desiderata would lead to the absurdity that no method can

resolve the problems (why this is an absurdity was discussed above). An example of an

overly strict desideratum is the requirement that measurement processes be perfectly

repeatable, a demand that is unattainable in practice. On the other hand, overly lenient

desiderata would run the risk of vindicating methods that practitioners regard as flawed.

While not necessarily absurd, if such cases abounded they would eventually raise the worry

that one’s normative account fails to capture the problems that practitioners are trying to

solve. To avoid these two extremes, the epistemologist must be able to learn from practice what

counts a good solution to an epistemological problem, yet do so without relinquishing the

normativity of her account.

These seemingly conflicting needs are fulfilled by a method I call ‘normative analysis of

exemplary cases’. I provide original and detailed case studies of metrological methods that

practitioners consider exemplary solutions to the general epistemological problems posed

above. Being exemplary solutions, they must also come out as successful solutions in my own

epistemological account, for otherwise I have failed to capture the problems that

metrologists are trying to solve. Note that this is not a license to believe everything

practitioners say, but merely a reasonable starting point for a normative analysis of practice.

In other words, this method reflects a commitment to learn from practitioners what their

23

problems are and assess their success in solving these problems rather than the preconceived

problems of philosophers.

For my main case studies I have chosen to concentrate on the standardization of time

and frequency, the most accurately and stably realized physical quantities in contemporary

metrology. In addition to a study of the metrological literature, I spent several weeks at the

laboratories of the Time and Frequency Division at the US National Institute of Standards

and Technology (NIST) in Boulder, Colorado. I conducted interviews with ten of the

Division’s scientists as well as with several other specialists at the University of Colorado’s

JILA labs. In these interviews I invited metrologists to reflect on the reasons why they make

certain knowledge claims about atomic clocks (e.g. about their accuracy, errors and

agreement), on the methods they use to validate such claims, and on problems or limitations

they encounter in applying these methods.

These materials then served as the basis for abstracting common presuppositions and

inference patterns that characterize metrological methods more generally. At the same time,

my superficial ‘enculturation’ into metrological life allowed me to reconceptualise the general

epistemological problems and assess their relevance to the concrete challenges of the

laboratory. These ongoing iterations of abstraction and concretization eventually led to a

stable set of desiderata that fit the exemplars and at the same time were general enough to

extend beyond them.

24

7. Plan of thesis

This dissertation consists of four autonomous essays, each dedicated to a different

aspect of the epistemic and methodological challenges mentioned above. Rather than

advance a single argument, each essay contains a self-standing argument in favour of the

model-based account from different but interlocking perspectives.

Chapter 1 is dedicated to primary measurement standards, and debunks the myth

according to which such standards are perfectly accurate. I clarify how the uncertainty

associated with primary standards is evaluated and how the subsumption of standards under

idealized models justifies inferences from uncertainty to accuracy.

Chapter 2 introduces the problem of quantity individuation, and shows that this

problem cannot be solved independently of the problems of coordination and accuracy. The

model-based account is then presented and shown to dissolve all three problems at once.

Chapter 3 expands on the problem of coordination through a discussion of the

construction and maintenance of Coordinated Universal Time (UTC). As I argue, abstract

quantity concepts such as terrestrial time are not coordinated directly to any concrete clock,

but only indirectly through a hierarchy of models. This mediation explains how seemingly ad

hoc error corrections can stabilize the way an abstract quantity concept is applied to

particulars.

Finally, Chapter 4 extends the scope of discussion from standards to measurement

procedures in general by focusing on calibration. I show that calibration is a special sort of

modeling activity, and that measurement uncertainty is a special sort of predictive

uncertainty associated with this activity. The role of standards in calibration is clarified and a

25

general solution to the problem of accuracy is provided in terms of a robustness test among

predictions of multiple models.

26


1. How Accurate is the Standard Second?

Abstract: Contrary to the claim that measurement standards are absolutely accurate by definition, I argue that unit definitions do not completely fix the referents of unit terms. Instead, idealized models play a crucial semantic role in coordinating the theoretical definition of a unit with its multiple concrete realizations. The accuracy of realizations is evaluated by comparing them to each other in light of their respective models. The epistemic credentials of this method are examined and illustrated through an analysis of the contemporary standardization of time. I distinguish among five senses of ‘measurement accuracy’ and clarify how idealizations enable the assessment of accuracy in each sense.13

1.1. Introduction

A common philosophical myth states that the meter bar in Paris is exactly one meter

long – that is, if any determinate length can be ascribed to it in the metric system. One

variant of the myth comes from Wittgenstein, who tells us that the meter bar is the one

thing “of which one can say neither that it is one metre long, nor that it is not one metre

long” (1953 §50). Kripke famously disagrees, but develops a variant of the same myth by

13 This chapter was published with minor modifications as Tal (2011).

27

stating that the length of the bar at a specified time is rigidly designated by the phrase ‘one

meter’ (1980, 56). Neither of these pronouncements is easily reconciled with the 1960

declaration of the General Conference on Weights and Measures, according to which “the

international Prototype does not define the metre with an accuracy adequate for the present

needs of metrology” and is therefore replaced by an atomic standard (CGPM 1961). There

is, of course, nothing problematic with replacing one definition with another. But how can

the accuracy of the meter bar be evaluated against anything other than itself, let alone be

found lacking?

Wittgenstein and Kripke almost certainly did not subscribe to the myth they helped

disseminate. There are good reasons to believe that their examples were meant merely as

hypothetical illustrations of their views on meaning and reference14. This chapter does not

take issue with their accounts of language, but with the myth of the absolute accuracy of

measurement standards, which has remained unchallenged by philosophers of science. The

meter is not the only case where the myth clashes with scientific practice. The second and

the kilogram, which are currently used to define all other units in the International System

(i.e. the ‘metric’ system), are associated with primary standards that undergo routine accuracy

evaluations and are occasionally improved or replaced with more accurate ones. In the case

of the second, for example, the accuracy of primary standards has increased more than a

thousand-fold over the past four decades (Lombardi et al 2007).

This chapter will analyze the methodology of these evaluations, and argue that they

indeed provide estimates of accuracy in the same senses of ‘accuracy’ normally presupposed

14 Wittgenstein mentions the meter bar only in passing as an analogy to color language-games. Kripke carefully notes that the uniqueness of the meter bar’s role in standardizing length is no more than a hypothetical supposition (1980, 55, fn. 20.)

28

by scientific and philosophical discussions of measurement. My main examples will come

from the standardization of time. I will focus on the methods by which time and frequency

standards are evaluated and improved at the US National Institute of Standards and

Technology (NIST). These tasks are carried out by metrologists, experts in highly reliable

measurement. The methods and tools of metrology – a live discipline with its own journals

and controversies – have received little attention from philosophers15. Recent philosophical

literature on measurement has mostly been concerned either with the metaphysics of

quantity and number (Swoyer 1987, Michell 1994) or with the mathematical structures

underlying measurement scales (Krantz et al. 1971). These ‘abstract’ approaches treat the

topics of uncertainty, accuracy and error as extrinsic to the theory of measurement and as

arising merely from imperfections in its application. Though they do not deny that

measurement operations involve interactions with imperfect instruments in noisy

environments, authors in this tradition analyze measurement operations as if these

imperfections have already been corrected or controlled for.

By contrast, the current study is meant as a step towards a practice-oriented

epistemology of physical measurement. The management of uncertainty and error will be

viewed as intrinsic to measurement and as a precondition for the possibility of gaining

knowledge from the operation of measuring instruments. At the heart of this view lies the

recognition that a theory of measurement cannot be neatly separated into fundamental and

applied parts. The methods employed in practice to correct errors and evaluate uncertainties

crucially influence which answers are given to so-called ‘fundamental’ questions about

15 Notable exceptions are Chang (2004) and Boumans (2007). Metrology has been studied by historians and sociologists of science, e.g. Latour (1987, ch. 6), Schaffer (1992), Galison (2003) and Gooday (2004).

29

quantity individuation and the appropriateness of measurement scales. This will be argued in

detail in Chapter 2.

In this chapter I will use insights into metrological practices to outline a novel account

of the underexplored relationship between uncertainty and accuracy. Scientists often include

uncertainty estimates in their reports of measurement results, but whether such estimates

warrant claims about the accuracy of results is an epistemological question that philosophers

have overlooked. Based on an analysis of time standardization, I will argue that inferences

from uncertainty to accuracy are justified when a doubly robust fit – among instruments as

well as among idealized models of these instruments – is demonstrated. My account will

shed light on metrologists’ claims that the accuracy of standards is being continually

improved and on the role played by idealized models in these improvements.

1.2. Five notions of measurement accuracy

Accuracy is often ascribed to scientific theories, instruments, models, calculations and

data, although the meaning of the term varies greatly with context. Even within the limited

context of physical measurement the term carries multiple senses. For the sake of the

current discussion I offer a preliminary distinction among five notions of measurement

accuracy. These are intended to capture different senses of the term as it is used by

physicists and engineers as well as by philosophers of science. The five notions are neither

co-extensive nor mutually exclusive but instead partially overlap in their extensions. As I will

argue below, the sort of robustness test metrologists employ to evaluate the uncertainty of

30

measurement standards provides sufficient evidence for the accuracy of those standards

under all five senses of ‘measurement accuracy’.

1. Metaphysical measurement accuracy: closeness of agreement between a

measured value of a quantity and its true value16

(correlate concept: truth)

2. Epistemic measurement accuracy: closeness of agreement among values

reasonably attributed to a quantity based on its measurement17

(correlate concept: uncertainty)

3. Operational measurement accuracy: closeness of agreement between a

measured value of a quantity and a value of that quantity obtained by

reference to a measurement standard

(correlate concept: standardization)

4. Comparative measurement accuracy: closeness of agreement among

measured values of a quantity obtained by using different measuring systems,

or by varying extraneous conditions in a controlled manner

(correlate concept: reproducibility)

16 cf. “Measurement Accuracy” in the International Vocabulary of Metrology (VIM) (JCGM 2008, 2.13). My own definitions for measurement-related terms are inspired by, but in some cases diverge from, those of the VIM.

17 cf. JCGM 2008, 2.13, Note 3.

31

5. Pragmatic measurement accuracy (‘accuracy for’): measurement accuracy

(in any of the above four senses) sufficient for meeting the requirements of a

specified application.

Let us briefly clarify each of these five notions. First, the metaphysical notion takes

truth to be the standard of accuracy. For example, a thermometer is metaphysically accurate

if its outcomes are close to true ratios between measured temperature intervals and the

chosen unit interval. If one assumes a traditional understanding of truth as correspondence

with a mind-independent reality, the notion of metaphysical accuracy presupposes some

form of realism about quantities. The argument advanced in this chapter is nevertheless

independent of such realist assumptions, as it neither endorses nor rejects metaphysical

conceptions of measurement accuracy.

Second, a thermometer is epistemically accurate if its design and use warrant the

attribution of a narrow range of temperature values to objects. The dispersion of reasonably

attributed values is called measurement uncertainty and is commonly expressed as a value

range18. Epistemic accuracy should not be confused with precision, which constitutes only one

aspect of epistemic accuracy. Measurement precision is the closeness of agreement among

measured values obtained by repeated measurements of the same (or relevantly similar)

18 cf. “Measurement Uncertainty” (JCGM 2008, 2.26.) Note that this term does not refer to a degree of confidence or belief but to a dispersion of values whose attribution to a quantity reasonably satisfies a specified degree of confidence or belief.

32

objects using the same measuring system19. Imprecision is therefore caused by uncontrolled

variations to the equipment, operation or environment when measurements are repeated.

This sort of variation is a ubiquitous but not exclusive source of measurement uncertainty.

As will be explained below, some measurement uncertainty stems from other sources,

including imperfect corrections to systematic errors. The notion of epistemic accuracy is

therefore broader than that of precision.

Third, operational measurement accuracy is determined relative to an established

measurement standard. For example, a thermometer is operationally accurate if its outcomes are

close to those of a standard thermometer when the two measure relevantly similar samples.

The most common way of evaluating operational accuracy is by calibration, i.e. by modeling

an instrument in a manner that establishes a relation between its indications and standard

quantity values20.

Fourth, comparative accuracy is the closeness of agreement among measurement

outcomes when the same quantity is measured in different ways. The notion of comparative

accuracy is closely linked with that of reproducibility. To say that a measurement outcome is

comparatively accurate is to say that it is closely reproducible under controlled variations to

measurement conditions and methods.21 For example, thermometers in a given set are

comparatively accurate if their outcomes closely agree with one another’s when applied to

relevantly similar samples.

19 cf. “Measurement Precision” (JCGM 2008, 2.15) and “Measurement Repeatability” (ibid, 2.21). My concept of precision is narrower than that of the VIM (see also fn. 21.)

20 cf. “Calibration” (JCGM 2008, 2.39) 21 Unlike precision, reproducibility concerns controlled variations to measurement conditions. I deviate

slightly from the VIM on this point to reflect general scientific usage of these terms (cf. “Measurement Reproducibility”, JCGM 2008, 2.25).

33

Finally, pragmatic measurement accuracy is accuracy sufficient for a specific use, such

as a solution to an engineering problem. There are four sub-senses of pragmatic accuracy,

corresponding to the first four senses of measurement accuracy. For example, a

thermometer is pragmatically accurate in an epistemic sense if the overall uncertainty of its

outcomes is low enough to reliably achieve a specified goal, e.g. keeping an engine from

over-heating. Of course, whether or not a measuring system (or a measured value) is

pragmatically accurate depends on its intended use22.

In the physical sciences quantitative expressions of measurement accuracy are typically

cast in epistemic terms, namely in terms of uncertainty. This does not mean that scientific

estimates of accuracy are always and only estimates of epistemic accuracy. What matters to

the classification of accuracy is not the form of its expression, but the kind of evidence on

which estimates of accuracy are based. As I will argue below, metrological evaluations

provide evidence of the right sort for estimating accuracy under all five notions. Before

delving into the argument, the next section will provide some background on the concepts,

methods and problems involved in the standardization of time.

1.3. The multiple realizability of unit definitions

A key distinction in the standardization of physical units is that between definition

and realization. Since 1967 the second has been defined as the duration of exactly

22 Pragmatic accuracy may be understood as a threshold (pass/fail) concept. Alternatively, pragmatic accuracy may be represented continuously, for example as the likelihood of achieving the specified goal. Both analyses of the concept are compatible with the argument presented here.

34

9,192,631,770 periods of the radiation corresponding to a hyperfine transition of cesium-133

in the ground state (BIPM 2006). This definition pertains to an unperturbed cesium atom at

a temperature of absolute zero. Being an idealized description of a kind of atomic system, no

actual cesium atom ever satisfies this definition. Hence a question arises as to how the

reference of ‘second’ is fixed. The traditional philosophical approach would be to propose

some ‘semantic machinery’ through which the definition succeeds in picking out a definite

duration, e.g. a possible-world semantics of counterfactuals. However, this sort of approach

is hard pressed to explain how metrologists are able to experimentally access the extension

of ‘second' given the fact that it is physically impossible to instantiate the conditions

specified by the definition. Consequently, it becomes unclear how metrologists are able to

tell whether the actual durations they label ‘second’ satisfy the definition. By contrast, the

approach adopted in this chapter takes the definition to fix a reference only indirectly and

approximately by virtue of its role in guiding the construction of atomic clocks. Rather than

picking out any definite duration on its own, the definition functions as an ideal specification

for a class of atomic clocks. These clocks approximately satisfy – or in the metrological jargon,

‘realize’ – the conditions specified by the definition23. The activities of constructing and

modeling cesium clocks are therefore taken to fulfill a semantic function, i.e. that of

approximately fixing the reference of ‘second’, rather than simply measuring an already

linguistically fixed time interval.

The construction of an accurate primary realization of the second – a ‘meter stick’ of

time – must make highly sophisticated use of theory, apparatus and data analysis in order to

23 The verb ‘realize’ has various meanings in philosophical discussions. Here I follow the metrological use of this term and take it to be synonymous with ‘approximately satisfy’ (pertaining to a definition.)

35

approximate as much as possible the ideal conditions specified by the definition. But

multiple kinds of physical processes can be constructed that would realize the second, each

departing from the ideal definition in different respects and degrees. In other words,

different clock designs and environments correspond to different ways of de-idealizing the

definition. As of 2009, thirteen atomic clocks around the globe are used as primary

realizations of the second. There are also hundreds of official secondary realizations of the

second, i.e. atomic clocks that are traced to primary realizations. Like any collection of

physical instruments, different realizations of the second disagree with one another, i.e. ‘tick’

at slightly different rates. The definition of the second is thus multiply realizable in the sense

that multiple real durations approximately satisfy the definition, and no method can

completely rid us of the approximations.

That the definition of the second is multiply realizable does not mean that there are

as many ‘seconds’ as there are clocks. What it does mean is that metrologists are faced with

the task of continually evaluating the accuracy of each realization relative to the ideal cesium

transition frequency and correcting its results accordingly. But the ideal frequency is

experimentally inaccessible, and primary standards have no higher standard against which

they can be compared. The challenge, then, is to forge a unified second out of disparately

‘ticking’ clocks. This is an instance of a general problem that I will call the problem of multiple

realizability of unit definitions24. This problem is semantic, epistemological and methodological

all at once. To solve it is to specify experimentally testable satisfaction criteria for the

24 Chang’s (2004, 59) ‘problem of nomic measurement’ is a closely related, though distinct, problem concerning the standardization of instruments. Both problems instantiate the entanglement between claims to coordination, accuracy and quantity individuation mentioned in the introduction to this thesis. This entanglement and its consequences will be discussed in detail in Chapter 2.

36

idealized definition of ‘second’, a task which is equivalent to that of specifying grounds for

making accuracy claims about cesium atomic clocks, which is in turn equivalent to the task

of specifying a method for reconciling discrepancies among such clocks. The conceptual

distinction among three axes of the problem should not obstruct its pragmatic unity, for as

we shall see below, metrologists are able to resolve all three aspects of the problem

simultaneously.

Prima facie, the problem can be solved by arbitrarily choosing one realization as the

ultimate standard. Yet this solution would bind all measurement to the idiosyncrasies of a

specific artifact, thereby causing measurement outcomes to diverge unnecessarily. Imagine

that all clocks were calibrated against the ‘ticks’ of a single apparatus: the instabilities of that

apparatus would cause clocks to run faster or slower relative to each other depending on the

time of their calibration, and the discrepancy would be revealed when these clocks were

compared to each other. A similar scenario has recently unfolded with respect to the

International Prototype of the kilogram when its mass was discovered to systematically ‘drift’

from those of its official copies (Girard 1994). Hence a stipulative approach to unit

definitions exacerbates rather than removes the challenge of reconciling discrepancies

among multiple standards.

The latter point is helpful in elucidating the misunderstanding behind the myth of

absolute accuracy. Once it is acknowledged that unit definitions are multiply realizable, it

becomes clear that no single physical object can be used in practice to completely fix the

reference of a unit term. Rather, this reference must be determined by an ongoing

comparison among multiple realizations. Because these comparisons involve some

uncertainty, the references of unit terms remain vague to some extent. Nevertheless, as the

next sections will make clear, comparisons among standards allow metrologists to minimize

37

this vagueness, thereby providing an optimal solution to the problem of multiple

realizability.

1.4. Uncertainty and de-idealization

The clock design currently implemented in most primary realizations of the second is

known as the cesium ‘fountain’, so called because cesium atoms are tossed up in a vacuum

cylinder and fall back due to gravity. The best cesium fountains are said to measure the

relevant cesium transition frequency with a fractional uncertainty25 of less than 5 parts in 1016

(Jefferts et al. 2007). It is worthwhile to examine how this number is determined. To start

off, it is tempting to interpret this number naively as the standard deviation of clock

outcomes from the ideally defined duration of the second. However, because the definition

pertains to atoms in physically unattainable conditions, the aforementioned uncertainty

could not have been evaluated by direct reference to the ideal second. Nor is this number

the standard deviation of a sample of readings taken from multiple cesium fountain clocks.

If a purely statistical approach of this sort were taken, metrologists would have little insight

into the causes of the distribution and would be unable to tell which clocks ‘tick’ closer to

the defined frequency.

The accepted metrological solution to these difficulties is to de-idealize the

theoretical definition of the second in discrete ‘steps’, and estimate the uncertainty that each

‘step’ contributes to the outcomes of a given clock. The uncertainty associated with a

25 ‘Fractional’ or ‘relative’ uncertainty is the ratio between measurement uncertainty and the best estimate of the value being measured (usually the mean.)

38

specific primary frequency standard is then taken to be the total uncertainty contributed to

its outcomes by a sufficient de-idealization of the definition of the second as it applies to that particular

clock. The rest of the present section will describe how this uncertainty is evaluated, and the

next section will describe what kind of evidence is taken to establish the ‘sufficiency’ of de-

idealization.

Two kinds of de-idealization of the definition are involved in evaluating the

uncertainty of frequency standards. These correspond to two different methods of

evaluating measurement uncertainty that metrologists label ‘type-A’ and ‘type-B’26. First, the

definition of the second is idealized in the sense that it presupposes that the relevant

frequency of cesium is a single-valued number. By contrast, the frequency of any real

oscillation converges to a single value only if averaged over an infinite duration, due to so-

called ‘random’ fluctuations. De-idealizing the definition in this respect means specifying a

set of finite run times, and evaluating the width of the distribution of frequencies for each

run time. Uncertainties evaluated in this manner, i.e. by the statistical analysis of a series of

observations, fall under ‘type-A’.

The second kind of de-idealization of the definition has to do with systematic effects.

For example, one way in which the definition of the second is idealized is that it presupposes

that the cesium atom resides in a completely flat spacetime, i.e. a gravitational potential of

zero. When measured in real conditions on earth, general relativity predicts that the cesium

frequency would be red-shifted depending on the altitude of the laboratory housing the

clock. The magnitude of this ‘bias’ is calculated based on a theoretical model of the earth’s

26 See JCGM (2008a) for a comprehensive discussion. The distinction between type-A and type-B uncertainty is unrelated to that of type I vs. type II error. Nor should it be confused with the distinction between random and systematic error.

39

gravitational field and an altitude measurement27. The measurement of altitude itself involves

some uncertainty, which propagates to the estimate of the shift and therefore to the

corrected outcomes of the clock. This sort of uncertainty, i.e. uncertainty associated with

corrections to systematic errors, falls under ‘type-B’.

In addition to gravitational effects, numerous other effects must be estimated and

corrected for a cesium fountain. With every such de-idealization and correction, some type-

B uncertainty is added to the final outcome, i.e. to the number of ‘ticks’ the clock is said to

have generated in a given period. The overall type-B uncertainty associated with the clock is

then taken to be equal to the root sum of squares of these individual uncertainties28. In other

words, the type-B uncertainty of a primary standard is determined by the accumulated

uncertainty associated with corrections applied to its readings. The general method of evaluating

the overall accuracy of measuring systems in this way is known as uncertainty budgeting.

Metrologists draw up tables with the contribution of each correction and a ‘bottom line’ that

expresses the total type-B uncertainty (an example will be given in Section 1.6). Such tables

make explicit the fact that ‘raw ticks’ generated by a clock are by themselves insufficient to

determine the uncertainty associated with that clock. Uncertainties crucially depend not only

on the apparatus, but also on how the apparatus is modeled, and on the level of detail with

which such models capture the idiosyncrasies of a particular apparatus.

27 The calculation of this shift involves the postulation of an imaginary rotating sphere of equal local gravitational potential called a geoid, which roughly corresponds to the earth’s sea level. Normalization to the geoid is intended to transform the proper time of each clock to the coordinate time on the geoid. See for example Jefferts et al (2002, 328).

28 This method of adding uncertainties is only allowed when it is safe to assume that uncertainties are uncorrelated.

40

1.5. A robustness condition for accuracy

We saw that metrologists successively de-idealize the definition of the second until it

describes the specific apparatus at hand. The type-A and type-B uncertainties accumulated in

this process are combined to produce an overall uncertainty estimate for a given clock. This

is how, for example, metrologists arrived at the estimate of fractional frequency uncertainty

cited in the previous section.

A question nevertheless remains as to how metrologists determine the point at which

de-idealization is ‘sufficient’. After all, a complete de-idealization of any physical system is itself

an unattainable ideal. Indeed, the most difficult challenges that metrologists face involve

building confidence in descriptions of their apparatus. Such confidence is achieved by

pursuing two interlocking lines of inquiry: on the one hand, metrologists work to increase

the level of detail with which they model clocks. On the other hand, clocks are continually

compared to each other in light of their most recent theoretical and statistical models. The

uncertainty budget associated with a standard is then considered sufficiently detailed if and

only if these two lines of inquiry yield consistent results. The upshot of this method is that

the uncertainty ascribed to a standard clock is deemed adequate if and only if the outcomes of

that clock converge to those of other clocks within the uncertainties ascribed to each clock by appropriate

models, where appropriateness is determined by the best currently available theoretical

knowledge and data-analysis methods. This kind of convergence is routinely tested for all

active cesium fountains (Parker et al 2001, Li et al 2004, Gerginov 2010) as well as for

candidate future standards, as will be shown below.

41

The requirement for convergence under appropriate models embeds a double

robustness condition, which may be generalized in the following way:

(RC) Given multiple, sufficiently diverse realizations of the same unit, the

uncertainties ascribed to these realizations are adequate if and only if

(i) discrepancies among realizations fall within their ascribed

uncertainties; and

(ii) the ascribed uncertainties are derived from appropriate models of

each realization.

These two conditions loosely correspond to what Woodward (2006) calls

‘measurement robustness’ and ‘derivational robustness’. The first kind of robustness

concerns the stability of a measured value under varying measurement procedures, while the

second concerns the stability of a prediction under varying modeling assumptions. Note,

however, that in the present case we are not dealing with two independently satisfiable

conditions, but with two sides of a single, composite robustness condition. Recall that the

discrepancies mentioned in sub-condition (i) already incorporate corrections to the quantity

being compared, corrections that were calculated in light of detailed models of the relevant

apparatuses. Conversely, the ‘appropriateness’ of models in (ii) is considered sufficiently

established only once it is shown that these models correctly predict the range of

discrepancies among realizations.

Metrology teaches us that (RC) is indeed satisfied in many cases, sometimes with

stunningly small uncertainties. However, the question remains as to why one should take

uncertainties that satisfy this condition to be measures of the accuracy of standards. This

42

question can be answered by considering each of the five variants of accuracy outlined

above.

To start with the most straightforward case, the comparative accuracy of realizations

is simply the closeness of agreement among them, e.g. the relative closeness of the

frequencies of different cesium fountains. Clearly, uncertainties that fulfill sub-condition (i)

are (inverse) estimates of accuracy in this sense.

Second, from an operational point of view, the accuracy of a standard is the

closeness of its agreement to other standards of the same quantity. This is again explicitly

guaranteed by the fulfillment of (RC) under sub-condition (i). That sub-condition (i)

guarantees two types of accuracy is hardly surprising, since in the special case of

comparisons among standards the notions of comparative and operational accuracy are

coextensive.

Third, the epistemic conception of accuracy identifies the accuracy of a standard

with the narrowness of spread of values reasonably attributed to the quantity realized by that

standard. The evaluation of type-A and type-B uncertainties in light of current theories,

models and data-analysis tools is plausibly the most rigorous way of estimating the range of

durations that reasonably satisfy the definition of ‘second’. The appropriateness requirement

in sub-condition (ii) guarantees that uncertainties are evaluated in this way whenever

possible.

Fourth, according to the metaphysical conception of accuracy, the accuracy of a

standard is the degree of closeness between the estimated and true values of the realized

quantity. Here one may adopt a skeptical position and claim that the true values of physical

quantities are generally unknowable. The skeptic is in principle correct: it may be the case

that despite their diversity, all the measurement standards that metrologists have compared

43

are plagued by a common systematic effect that equally influences the realized quantity and

thus remains undetected. But for a non-skeptical realist who believes (for whichever reason)

that current theories are true or approximately true, condition (RC) provides a powerful test

for metaphysical accuracy because it relies on the successive de-idealization of the theoretical

definition of the relevant unit. Estimating the metaphysical accuracy of a cesium clock, for

example, amounts to determining the conceptual ‘distance’ of that clock from the ideal

conditions specified by the definition of the second. As mentioned, the uncertainties that go

into (RC) are consequences of precisely those respects in which the realization of a unit falls

short of the definition. It is therefore plausible to consider cross-checked uncertainty

budgets of multiple primary standards as supplying good estimates of metaphysical accuracy.

Nevertheless, it is important to note that condition (RC) and the method of uncertainty

budgeting do not presuppose anything about the truth of our current theories or the reality of

quantities. That is, (RC) is compatible with a non-skeptical realist notion of accuracy without

requiring commitment to its underlying metaphysics.

Finally, from a pragmatic point of view the accuracy of a standard is its capacity to

meet the accuracy needs of a certain application. Here the notion of ‘accuracy needs’ is

cashed out in terms of one of the first four notions of accuracy. As the uncertainties

vindicated by (RC) have already been shown to be adequate estimates of accuracy under the

first four notions, they are ipso facto adequate for the estimation of pragmatic accuracy.

44

1.6. Future definitions of the second

The methodological requirement to maximize robustness to the limit of what is

practically possible is one of the main reasons why unit definitions are not chosen arbitrarily.

If the definitions of units were determined arbitrarily, their replacement would be arbitrary

as well. But, as metrologists know only too well, changes to unit definitions involve a

complex web of theoretical, technological and economic considerations. Before the

metrological community accepts a new definition, it must be convinced that the relevant unit

can be realized more accurately with the new definition than with the old one. Here again ‘accuracy’ is

cashed in terms of robustness. In the case of the second, for example, a new generation of

‘optical’ atomic clocks is already claimed to have achieved “an accuracy that exceeds current

realizations of the SI unit of time” (Rosenband et al. 2008, 1809). To demonstrate accuracy

that surpasses the current cesium standard, optical clocks are compared to each other in light

of their most detailed models available. Table 1.1 presents a comparative uncertainty budget

for aluminum and mercury optical clocks recently evaluated at NIST. The theoretical

description of each atomic system is de-idealized successively, and the uncertainties

contributed by each component add up to the ‘bottom line’ type-B uncertainty for each

clock. These uncertainties are roughly an order of magnitude lower than those ascribed to

cesium fountain clocks.

45

Table 1.1: Comparison of uncertainty budgets of aluminum (Al) and mercury (Hg) optical atomic clocks. This table was used to support the claim that both clocks are more accurate than the current cesium standard. ∆ν stands for fractional frequency bias and σ stands for uncertainty, both expressed

in 10-18. (source: Rosenband et al 2008, 1809. Reprinted with permission from AAAS)

The experimenters showed that successive comparisons of the frequencies of these

clocks indeed yield outcomes that fall within the ascribed bounds of uncertainty, thereby

applying the robustness condition above. The fact that these clocks involve two different

kinds of atoms was taken to strengthen the robustness of the results. Nevertheless, it is

unlikely that the second will be redefined in the near future in terms of an optical transition.

More optical clocks must be built and compared before metrologists are convinced that such

clocks are modeled with sufficient detail. Meanwhile the accuracy of current cesium

standards is still being improved by employing new methods of controlling and correcting

for errors. In the long run, however, increasing technological challenges involved in

improving the accuracy of cesium fountains are expected to lead to the adoption of new

sorts of primary realizations of the second such as optical clocks.

46

1.7. Implications and conclusions

As the foregoing discussion has made clear, measurement standards are not

absolutely accurate, nor are they chosen arbitrarily. Moreover, unit definitions do not

completely fix the reference of unit terms, unless ‘fixing’ is understood in a manner that is

utterly divorced from practice. Instead, choices of unit definition, as well as choices of

realization for a given unit definition, are informed by intricate considerations from theory,

technology and data analysis.

The study of these considerations reveals the ongoing nature of standardization

projects. Theoretically, quantities such as mass, length and time are represented by real

numbers on continuous scales. The mathematical treatment of these quantities is indifferent

to the accuracy with which they are measured. But in practice, we saw that the procedures

required to measure duration in seconds change with the degree of accuracy demanded.

Consequently, a necessary condition for the possibility of increasing measurement accuracy

is that unit-concepts are continually re-coordinated with new measuring procedures29. Metrologists are

responsible for performing such acts of re-coordination in the most seamless manner

possible, so that for all practical purposes the second, meter and kilogram appear to remain

unchanged. This is achieved by constructing and improving primary and secondary

realizations, and (less frequently) by redefinition. The dynamic coordination of quantity

concepts with increasingly robust networks of instruments allows measurement results to

retain their validity even when standards are improved or replaced. Moreover, increasing

29 See van Fraassen’s discussion of the problem of coordination (2008, ch.5). I take my own robustness condition (RC) to be a methodological explication of van Fraassen’s ‘coherence constraint’ on acceptable solutions to this problem.

47

robustness minimizes vagueness surrounding the reference of unit terms, thereby providing

an optimal solution to the problem of multiple realizability of unit definitions.

48


2. Systematic Error and the Problem of Quantity Individuation

Abstract: When discrepancies are discovered between outcomes of different measuring instruments two sorts of explanation are open to scientists. Either (i) some of the outcomes are inaccurate or (ii) the instruments measure different quantities. Here I argue that, due to the possibility of systematic error, the choice between (i) and (ii) is in principle underdetermined by the evidence. This poses a problem for several contemporary philosophical accounts of measurement, which attempt to analyze ‘foundational’ concepts like quantity independently of ‘applied’ concepts like error. I propose an alternative, model-based account of measurement that challenges the distinction between foundations and application, and show that this account dissolves the problem of quantity individuation.

2.1. Introduction

Physical quantities – the speed of light, the melting point of gold, the earth’s diameter

– can often be measured in more than one way. Instruments that measure a given quantity

may differ markedly in the physical principles they utilize, and it is difficult to imagine

scientific inquiry proceeding were this not the case. The possibility of measuring the same

quantity in different ways is crucial to the detection of experimental errors and the

development of general scientific theories. An important question for any epistemology of

49

measurement is therefore: ‘how are scientists able to know whether or not different

instruments measure the same quantity?’

However straightforward this question may seem, an adequate account of quantity

individuation across measurement procedures has so far eluded philosophical accounts of

measurement. Contemporary measurement theories either completely neglect this question

or provide overly simplistic answers. As this chapter will show, the question of quantity

individuation is of central concern to theories of measurement. Not only is the question

more difficult than previously thought, but when properly appreciated the challenge posed

by this question undermines a widespread presupposition in contemporary philosophy of

measurement. This presupposition will be referred to here as conceptual foundationalism.

Prevalent in the titles of key works such as Ellis’ Basic Concepts of Measurement (1966) and

Krantz et al’s Foundations of Measurement (1971), conceptual foundationalism is the thesis that

measurement concepts are rigidly divided into ‘fundamental’ and ‘applied’ types, the former

but not the latter being the legitimate domain of philosophical analysis. Fundamental

measurement concepts – particularly, the notions of quantity and scale – are supposed to

have universal criteria of existence and identity. Such criteria apply to any measurement

regardless of its specific features. For example, whether or not two procedures measure the

same quantity is determined by applying a universal criterion of quantity identity to their

results, regardless of which quantity they happen to measure or how accurately they happen

to measure it. By contrast, ‘applied’ concepts like accuracy and error are seen as experimental

in nature. Discussion of the ‘applied’ portion of measurement theory is accordingly left to

laboratory manuals or other forms of discipline-specific technical literature.

As I will argue in this chapter, conceptual foundationalist approaches do not, and

cannot, provide an adequate analysis of the notion of measurable quantity. This is because

50

the epistemic individuation of measurable quantities essentially depends on considerations of

error distribution across measurement procedures. Questions of the form ‘what quantity

does procedure P measure?’ cannot be answered independently of questions about the

accuracy of P. Deep conceptual and epistemic links tie together the so-called ‘fundamental’

and ‘applied’ parts of measurement theory and prevent identity criteria from being specified

for measurable quantities independently of the specific circumstances of their measurement.

The main reason that these links have been ignored thus far is a misunderstanding of

the notion of measurement error, and particularly systematic measurement error. The

possibility of systematic error – if it is acknowledged at all in philosophical discussions of

measurement – is usually brought up merely to clarify its irrelevance to the discussion30. The

next section of this chapter will therefore be dedicated to an explication of the idea of

systematic error and its relation to theoretical and statistical assumptions about the specific

measurement process. These insights will be used to generate a challenge for the conceptual

foundationalist that I will call ‘the problem of quantity individuation.’ The following section

will discuss the ramifications of this problem for several conceptual foundationalist theories

of measurement, including the Representational Theory of Measurement (Krantz et al 1971.)

Finally, Section 2.4 will present an alternative, non-foundationalist account of quantity

individuation. I will argue that claims to quantity individuation are adequately tested by

establishing coherence and consistency among models of different measuring instruments.

The account will serve to elucidate the model-based approach to measurement and to

demonstrate its ability to avoid conceptual problems associated with foundationalism.

Moreover, the model-based approach will provide a novel understanding of the epistemic

30 See for example Campbell (1920, pp. 471-3)

51

functions of systematic errors. Instead of being conceived merely as obstacles to the

reliability of experiments, systematic errors will be shown to constitute indispensable tools

for unifying quantity concepts in the face of seemingly inconsistent evidence.

2.2. The problem of quantity individuation

2.2.1. Agreement and error

How can one tell whether two different instruments measure the same quantity? This

question poses a fundamental challenge to the epistemology of measurement. For any

attempt to test whether two instruments measure the same quantity, either by direct

comparison or by reference to other instruments, involves testing for agreement among

measurement outcomes; but any test of agreement among measurement outcomes must

already presuppose that those outcomes pertain to the same quantity.

To clarify the difficulty, let us first consider the sort of evidence required to establish

agreement among the outcomes of different measurements. We can imagine two

instruments that are intended to measure the same quantity, such as two thermometers. For

the sake of simplicity, we may assume that the instruments operate on a common set of

measured samples. Now suppose that we are asked to devise a test that would determine

whether the two instruments agree in their outcomes when they are applied to samples in

the set.

Naively, one may propose that the instruments agree if and only if their indications

exactly coincide when introduced with the same samples under the same conditions. But

variations in operational or environmental conditions cause indications to diverge between

52

successive measurements, and this should not count as evidence against the claim that the

outcomes of the two instruments are compatible.

A more sophisticated proposal would be to repeat the comparison between the

instruments several times under controlled conditions and to use one of any number of

statistical tests to determine whether the difference in readings is consonant with the

hypothesis of agreement between the instruments. This procedure would determine whether

the readings of the two instruments coincide within type-A (or ‘statistical’) uncertainty, the

component of measurement uncertainty typically associated with random error31.

However, due to the possibility of systematic error, coincidence within type-A

uncertainty is neither a necessary nor sufficient criterion for agreement among measuring

instruments, regardless of which statistical test is used. Mathematically speaking, an error is

‘systematic’ if its expected value after many repeated measurements is nonzero32. In most

cases, the existence of such errors cannot be inferred from the distribution of repeated

readings but must involve some external standard of accuracy33. Once systematic errors are

corrected, seemingly disparate readings may turn out to stand for compatible outcomes

while apparently convergent readings can prove to mask disagreement. Consequently, it is

impossible to adjudicate questions concerning agreement among measuring instruments

before systematic errors have been corrected.

31 My terminology follows the official vocabulary of the International Bureau of Weights and Measures as published by the Joint Committee for Guides in Metrology. For definitions and discussion of type-A and type-B uncertainties see JCGM (2008, 2008a). Note that the distinction between type-A and type-B uncertainty is unrelated to that of type I vs. type II error. Nor should it be confused with the distinction between random and systematic error.

32 cf. JCGM (2008, 2.17) 33 Some systematic errors can be evaluated purely statistically, such as random walk noise (aka Brownian

noise) in the frequency of electric signals.

53

A well-known example34 concerns glass containers filled with different thermometric

fluids – e.g. mercury, alcohol and air. If one examines the volume indications of these

thermometers when applied to various samples, one discovers that temperature intervals that

are deemed equal by one instrument are deemed unequal by the others. These discrepancies

are stable over many trials and therefore not eliminable through statistical analysis of

repeated measurements. Moreover, because the ratio between corresponding volume

intervals measured by different thermometers is not constant, it is impossible to eliminate

the discrepancy by linear scale transformations such as from Celsius to Fahrenheit.

Nevertheless, from the point of view of scientific methodology these thermometers

may still turn out to be in good agreement once an appropriate nonlinear correction is

applied to their readings. Such numerical correction is often made transparent to users by

manipulating the output of the instrument, e.g. by incorporating the correction into the

gradations on the display. For example, if the thermometers appear to disagree on the

location of the midpoint between the temperatures of freezing and boiling water, the ‘50

Celsius’ marks on their displays may simply be moved so as to restore agreement. Corrective

procedures of this sort are commonplace during calibration and are viewed by scientists as

enhancing the accuracy of measuring instruments. Indeed, in discussions concerning

agreement among measurement outcomes scientists almost never compare ‘raw’, pre-

calibrated indications of instruments directly to each other, a comparison that is thought to

be uninformative and potentially misleading.

What sort of evidence should one look for to decide whether and how much to

correct the indications of measuring instruments? Background assumptions about what the

34 See Mach (1966 [1896]), Ellis (1966, 90-110), Chang (2004, Ch. 2) and van Fraassen (2008, 125-30).

54

instrument is measuring play an important role here. When in 1887 Michelson and Morley

measured the velocity of light beams propagating parallel and perpendicular to the supposed

ether wind they observed little or no significant discrepancy35. Whether this result stands for

agreement between the two values of velocity nevertheless depends on how one represents the

apparatus and its interaction with its environment. Fitzgerald and Lorentz hypothesized that

the arms of the interferometer contracted in dependence on their orientation relative to the

ether wind, an effect that would result in a systematic error that exactly cancels out the

expected difference of light speeds. According to this representation of what the apparatus

was measuring, the seeming convergence in velocities merely masked disagreement. By

contrast, under Special Relativity length contraction is considered not an extraneous

disturbance to the measurement of the velocity of light but a fundamental consequence of

its invariance, and the results are taken to indicate genuine agreement. Hence an effect that

requires systematic correction under one representation of the apparatus is deemed part of

the correct operation of the apparatus under another.

A similar point is illustrated, though under very different theoretical circumstances, by

the development of thermometry in the eighteenth and nineteenth centuries. As noted by

Chang (2004, Ch. 2), by the mid-1700s it was well known that thermometers employing

different sorts of fluids exhibit nonlinear discrepancies. This discovery prompted the

rejection of the naive assumption that the volume indications of all thermometers were

linearly correlated with temperature. Eventually, comparisons among thermometers

(culminating in the work of Henri Regnault in the 1840s) gave rise to the adoption of air

35 Here the comparison is not between different instruments but different operations that involve the same instrument. I take my argument to apply equally to both cases.

55

thermometers as standards. But the adoption of air as a standard thermometric fluid did not

cause other thermometers, such as mercury thermometers, to be viewed as categorically less

accurate. Instead, the adoption of the air standard led to the recalibration of mercury

thermometers under the assumption that their indications are nonlinearly correlated with

temperature. What matters to the accuracy of mercury thermometers under the new

assumption is no longer their linearity but the predictability of their deviation from air

thermometers. The indications of a mercury thermometer could now deviate from linearity

without any loss of accuracy as long as they were predictably correlated with corresponding

indications of a standard. Once again, what is taken to be an error under one representation

of the apparatus is deemed an accurate result under another.

2.2.2. The model-relativity of systematic error

The examples of thermometry and interferometry highlight an important feature of

systematic error: what counts as a systematic error depends on a set of assumptions

concerning what and how the instrument is measuring. These assumptions serve as a basis

for constructing a model of the measurement process, that is, an abstract quantitative

representation of the instrument’s behavior including its interactions with the sample and

environment. The main function of such models is to allow inferences to be made from

indications (or ‘readings’) of an instrument to values of the quantity being measured. While

various types of models are involved in interpreting measurement results, for the sake of the

current discussion it is sufficient to distinguish between models of the data generated by a

measurement process and theoretical models representing the dynamics of a measurement

56

process. Both sorts of models involve assumptions about the measuring instrument, the

sample and environment, but the kind of assumptions is different in either case.

Models of data (or ‘data models’) are constructed out of assumptions about the

relationship between possible values of the quantity being measured, possible indications of

the instrument, and values of extraneous variables, including time36. These assumptions are

used to predict a functional relation between the input and output of an instrument known

as the ‘calibration curve’. We already saw the centrality of data models to the detection of

systematic error in the thermometry example. The initial assumption of linear expansion of

fluids provided a rudimentary calibration curve that allowed inferring temperature from

volume. The linear data model nevertheless proved to be of limited accuracy, beyond which

its predictions came into conflict with the assumption that different instruments measure the

same single-valued quantity, temperature37. Hence a systematic error was detected based on

linear data models and later corrected by constructing more complex data models that

incorporate nonlinearities. In the course of this modeling activity the thermometers in

question are viewed largely as ‘black-boxes’, and very little is assumed about the mechanisms

that cause fluids and gases to expand when heated38.

In the Michelson-Morley example, by contrast, model selection was informed by a

theoretical account of how the apparatus worked. Generally, a theoretical model of a

measuring instrument represents the internal dynamics of the instrument as well as its

interactions with the environment (e.g. ether) and the measured sample (e.g. light beams.)

36 For a general account of models of data see Suppes (1962). 37 See Chang (2004, pp. 89-92) 38 The distinction between data models and theoretical models is closely related to the distinction

between ‘black-box’ and ‘white-box’ calibration tests. For detailed discussion of this distinction see Chapter 4, Sections 4.3 and 4.4. as well as Boumans (2006).

57

The accepted theoretical model of an instrument is crucial in specifying what the instrument is

measuring. The model also determines which behaviors of the instrument count as evidence

for a systematic error. Both of these epistemic functions are clearly illustrated by the

Michelson-Morley example, where the classical model of what the instrument is measuring

(light speed relative to the ether) was replaced with a relativistic model of what the

instrument is measuring (universally constant light speed in any inertial frame.) As part of

this change in the accepted theoretical model of the instrument, the dynamical explanation

of length contraction was replaced with a kinematic explanation. Rather than correct the

effects of length contraction, the new theoretical model of the apparatus conceptually

‘absorbed’ these effects into the value of the quantity being measured.

Despite vast differences between the two examples, they both illustrate the sensitivity

of systematic error to a representational context. That is, in both the thermometry and

interferometry cases the attribution of systematic errors to the indications of the instrument

depends on what the instrument is taken to measure under a given representation.

Furthermore, in both cases the error is corrected (or conceptually ‘eliminated’) merely by

modifying the model of the apparatus and without any physical change to its operation.

The following ‘methodological definition’ of systematic error makes explicit the

model-relativity of the concept:

Systematic error: a discrepancy whose expected value is nonzero between the

anticipated or standard value of a quantity and an estimate of that value based

on a model of a measurement process.

58

This definition is ‘methodological’ in the sense that it pertains to the method by which

systematic errors are detected and estimated. This way of defining systematic error differs

from ‘metaphysical’ definitions of error, which characterize measurement error in relation to

a quantity’s true value. The methodological definition has the advantage of being

straightforwardly applicable to scientific practice, because in most cases of physical

measurement the exact true value of a quantity is unknowable and thus cannot be used to

estimate the magnitude of errors.

Apart from its applicability to scientific practice, the methodological definition of

systematic error has the advantage of capturing all three sorts of ways in which systematic

errors may be corrected, namely (i) by physically modifying the measurement process – for

example, shielding the apparatus from causes of error; (ii) by modifying the theoretical or

data model of the apparatus or (iii) by modifying the anticipated value of the quantity being

measured. In everyday practice (i) and (ii) are usually used in combination, whereas (iii) is

much rarer and may occur due to a revision to the ‘recommended value’ of a constant or due

to changes in accepted theory39. I will discuss the first two sorts of correction in detail below.

The methodological definition of systematic error is still too broad for the purpose of

the current discussion, because it includes errors that can be eliminated simply by changing

the scale of measurement, e.g. by modifying the zero point of the indicator or by converting

from, say, Celsius to Fahrenheit. By contrast, a subset of systematic errors that I will call

‘genuine’ cannot be eliminated in this fashion:

39 The Michelson-Morley example illustrates a combination of (ii) and (iii), as both the theoretical model of the interferometer and the expected outcome of measurement are modified.

59

A genuine systematic error: a systematic error that cannot be eliminated merely

by a permissible transformation of measurement scale.

The possibility of genuine systematic error will prove crucial to the individuation of

quantities across different measuring instruments. Unless otherwise mentioned, the term

‘systematic error’ henceforth denotes genuine systematic errors.

2.2.3. Establishing agreement: a threefold condition

To recapitulate the trajectory of the discussion so far, the need to test whether

different instruments measure the same quantity has led us to look for a test for agreement

among measurement outcomes. We saw that agreement can only be established once

systematic errors have been corrected, and that what counts as a systematic error depends on

how instruments are modeled. Consequently, any test for agreement among measuring

instruments is itself model-relative. The model-relativity of agreement is a direct

consequence of the fact that a change in modeling assumptions may result in a different

attribution of systematic errors to the indications of instruments. For this reason, the results

of agreement tests between measuring instruments may be modified without any physical

change to the apparatus, merely by adopting different modeling assumptions with respect to

the behavior of instruments.

Agreement is therefore established by the convergence of outcomes under specified models

of measuring instruments. Specifically, detecting agreement requires that:

60

(R1) the instruments are modeled as measuring the same quantity, e.g. temperature

or the velocity of light40;

(R2) the indications of each instrument are corrected for systematic errors in light

of their respective models; and

(R3) the corrected indications converge within the bounds of measurement

uncertainty associated with each instrument.

As I have shown in the previous chapter, these requirements are implemented in

practice by measurement experts (or ‘metrologists’) to establish the compatibility of

measurement outcomes41.

Before we examine the epistemological ramifications of these three requirements, a

clarification is in order with respect to requirement (R3). This is the requirement that

convergence be demonstrated within the bounds of measurement uncertainty. The term

‘measurement uncertainty’ is here taken to refer to the overall uncertainty of a given

measurement, which includes not only type-A uncertainty calculated from the distribution of

repeated readings but also type-B uncertainty, the uncertainty associated with estimates of

the magnitudes of systematic errors42. Theoretical models of the apparatus play an important

role in evaluating type-B uncertainties and consequently in deciding what counts as

appropriate bounds for agreement between instruments. For example, the theoretical model

of cesium fountain clocks (the atomic clocks currently used to standardize the unit of time

40 This requirement may be specified either in terms of a quantity type (e.g. velocity) or in terms of a quantity token (e.g. velocity of light). Both formulations amount to the same criterion, for in both cases measurement outcomes are expressed in terms of some quantity token, e.g. a velocity of some thing.

41 See Chapter 1, Section 1.5, as well as the VIM definition of “Compatibility of Measurement Results” (JCGM 2008, 2.47)

42 See Chapter 1, Section 1.4 as well as fn. 31 above.

61

defined as one second) predicts that the output frequency of the clock will be affected by

collisions among cesium atoms. The higher the density of cesium atoms housed by the

clock, the larger the systematic error with respect to the quantity being measured, in this case

the ideal cesium frequency in the absence of such collisions. To estimate the magnitude of

the error, scientists manipulate the density of atoms and then extrapolate their data to the

limit of zero density. The estimated magnitude of the error is then used to correct the raw

output frequency of the clock, and the uncertainty associated with the extrapolation is added

to the overall measurement uncertainty associated with the clock43. This latter uncertainty is

classified as ‘type-B’ because it is derived from a secondary experiment on the apparatus

rather than from a statistical analysis of repeated clock readings.

The conditions under which requirement (R3) is fulfilled are therefore model-relative

in two ways. First, they depend on a theoretical or data model of the measurement process

to establish what counts as a systematic error and therefore what the corrected readings are.

Second, as just noted, these conditions depend on how type-B uncertainties are evaluated,

which again depends on theoretical and statistical assumptions about the apparatus. As the

first two requirements (R1) and (R2) are already explicitly tied to models, the upshot is that

each of the three requirements that together establish agreement among measuring

instruments is model-relative in some respect.

43 See Jefferts et al (2002) for a detailed discussion of this evaluation method.

62

2.2.4. Underdetermination

When the threefold condition above is used as a test for agreement among measuring

instruments, the corrected readings may turn out not to converge within the expected bounds

of uncertainty. In such case disagreement (or incompatibility) is detected between

measurements outcomes. There are accordingly three possible sorts of explanation for such

disagreement:

(H1) the instruments are not measuring the same quantity44;

(H2) systematic errors have been inappropriately evaluated; or

(H3) measurement uncertainty has been underestimated.

How does one determine which is the culprit (or culprits)? Prima facie, one should

attempt to test each of these three hypotheses independently. To test (H1) scientists may

attempt to calibrate the instruments in question against other instruments that are thought to

measure the desired quantity. But this sort of calibration is again a test of agreement. For

calibration to succeed, one must already presuppose under requirement (R1) that the calibrated

and calibrating instrument measure the same quantity. The success of calibration therefore

cannot be taken as evidence for this presupposition. Alternatively, if calibration fails

scientists are faced with the very same problem, now multiplied.

44 Cf. fn. 40. As before, it makes no conceptual difference whether this hypothesis is formulated in terms of a quantity type or a quantity token. The choice of formulation does, however, makes a practical difference to the strategies scientists are likely to employ to resolve discrepancies. See Section 2.4.3 for discussion.

63

Scientists may attempt to test (H2) or (H3) independently of (H1). But this is again

impossible, because the attribution of systematic error involved in claim (H2) is model-

relative and can only be tested by making assumptions about what the instruments are

measuring. Similarly, the evaluation of measurement uncertainty involved in testing (H3)

includes type-B evaluations that are relative to a theoretical model of the measurement

process. Moreover, as (H3) applies only to readings that have already been corrected for

systematic error, it cannot be tested independently of (H2) and ipso facto of (H1).

We are therefore confronted with an underdetermination problem. In the face of

disagreement among measurement outcomes, no amount of empirical evidence can alone

determine whether the procedures in question are inaccurate [(H2 or H3) is true] or whether

they are measuring different quantities (H1 is true). Any attempt to settle the issue by

collecting more evidence merely multiplies the same conundrum. I call this the problem of

quantity individuation. Like other cases of Duhemian underdetermination, it is only a problem

if one believes that there is a disciplined way of deciding which hypothesis to accept (or

reject) based on empirical evidence alone. As we shall see immediately below, several

contemporary philosophical theories of measurement indeed subscribe to this mistaken

belief. That is, they assume that questions of the form ‘what does procedure P measure?’ can

be answered decisively based on nothing more than the results of empirical tests, and

independently of any prior assumptions as to the accuracy of P. This belief lies at the heart

of the foundationalist approach to the notion of measurable quantity, a notion that is viewed

as epistemologically prior to the ‘applied’ challenges involved in making concrete

measurements.

A direct upshot of the problem of quantity individuation is that the individuation of

measurable quantities and the distribution of systematic error are but two sides of the same

64

epistemic coin. Specifically, the possibility of attributing genuine systematic errors to

measurement outcomes (along with relevant type-B uncertainties) is a necessary precondition

for the possibility of establishing the unity of quantities across the various instruments that

measure them. Unless genuine systematic errors are admitted as a possibility when analyzing

experimental results, instruments exhibiting nonscalable discrepancies cannot be taken to

measure the same quantity. Concepts such as temperature and the velocity of light therefore

owe their unity to the possibility of such errors, as do the laws in which such quantities

feature. The notion of measurement error, in other words, has a constructive function in the

elaboration of quantity concepts, a function that has so far remained unnoticed by theories

of measurement.

2.2.5. Conceptual vs. practical consequences

The problem of quantity individuation may strike one as counter-intuitive. Do not

scientists already know that their thermometers measure temperature before they set out to

detect systematic errors? The answer is that scientists often do know, but that their

knowledge is relative to background theoretical assumptions concerning temperature and to

certain traditions of interpreting empirical evidence. Such traditions serve, among other

things, to constrain the range of trade-offs between simplicity and explanatory power that a

scientific community would deem acceptable. Theoretical assumptions and interpretive

traditions inform the choices scientists make among the three hypotheses above. In ‘trivial’

cases of quantity individuation, namely in cases where previous agreement tests have already

been performed among similar instruments under similar conditions with a similar or higher

65

degree of accuracy, an appeal to background theories and traditions is usually sufficient for

determining which of the three hypotheses will be accepted.

As we shall see below, the foundationalist fallacy is to think of such choices as justified

in an absolute sense, that is, outside of the context of any particular scientific theory or

interpretive tradition. Van Fraassen calls this sort of absolutism the ‘view from nowhere’

(2008, 122) and rightly points out that there can be no way of answering questions of the

form ‘what does procedure P measure?’ independently of some established tradition of

theorizing and experimenting. He distinguishes between two sorts of contexts in which such

questions may be answered: ‘from within’, i.e. given the historically available theories and

instrumental practices at the time, or ‘from above’, i.e. retrospectively in light of

contemporary theories.

Although van Fraassen does not discuss the problem of quantity individuation, his

terminology is useful for distinguishing between two different consequences of this problem.

The first, conceptual consequence has already been mentioned: there can be no theory-free test

of quantity individuation. This consequence is not a problem of practicing scientists but only

for conceptual foundationalist accounts of measurement. It stems from the attempt to

devise a test for quantity individuation that would view measurement ‘from nowhere’, prior to

any theoretical assumptions about what is being measured and regardless of any particular

tradition of interpreting empirical evidence.

The other, practical consequence of the problem of quantity individuation is a challenge

for scientists engaged in ‘nontrivial’ measurement endeavors, ones that involve new kinds of

instruments, novel operating conditions or higher accuracy levels than previously achieved

for a given quantity. Exemplary procedures of calibration and error correction may not yet

exist for such measurements. In the face of incompatible outcomes from novel instruments,

66

then, researchers may not have at their disposal established methods for restoring

agreement. Nor can they settle the issue based on empirical evidence from comparison tests

alone, for as the problem of quantity individuation teaches us, such evidence is insufficient

for deciding which one (or more) of the three hypotheses above to accept. The practical

challenge is to devise new methods of adjudicating agreement and error ‘from within’, i.e. by

extending existing theoretical presuppositions and interpretive traditions to a new domain.

As we shall see below, multiple strategies are open to scientists confronted with

disagreement among novel measurements.

Historically, the process of extension has almost always been conservative. Scientists

engaged in cutting-edge measurement projects usually start off by dogmatically supposing that

their instruments will measure a given quantity in a new regime of accuracy or operating

conditions. This conservative approach is extremely fruitful as it leads to the discovery of

new systematic errors and to novel attempts to explain such errors. But such dogmatic

supposition should not be confused with empirical knowledge, because novel measurements

may lead to the discovery of new laws and to the postulation of quantities that are different

from those initially supposed. Instead, this sort of dogmatic supposition can be regarded as a

manifestation of a regulative ideal, an ideal that strives to keep the number of quantity

concepts small and underlying theories simple.

Due to their marked differences, I will consider the two consequences of the problem

of quantity individuation as two distinct problems. The next section will discuss the

conceptual problem and its consequences for foundationalist theories of measurement. The

following section will explain how the conceptual problem is dissolved by adopting a model-

based approach to measurement. I will then return to the practical problem – the problem of

deciding which hypotheses to accept in real, context-rich cases of disagreement – at the end

67

of Section 2.4. If not otherwise mentioned, the ‘problem of quantity individuation’

henceforth refers to the conceptual problem.

2.3. The shortcomings of foundationalism

The conceptual problem of quantity individuation should not come as a surprise to

philosophers of science. It is, after all, a special case of a well-known problem named after

Duhem45. Nevertheless, a look at contemporary works on the philosophy of measurement

reveals that the problem of quantity individuation has so far remained unrecognized. Worse

still, the consequences of this problem are in conflict with several existing accounts of

physical measurement. This section is dedicated to a discussion of the repercussions of the

conceptual problem of quantity individuation for three philosophical theories of

measurement. A by-product of explicating these repercussions is that the problem itself will

be further clarified.

All three philosophical accounts discussed here are empiricist, in the sense that they

attempt to reduce questions about the individuation of quantities to questions about

relations holding among observable results of empirical procedures. These accounts are also

foundationalist insofar as they take universal criteria pertaining to the configuration of

observable evidence to be sufficient for the individuation of quantities, regardless of

theoretical assumptions about what is being measured or local traditions of interpreting

evidence. Hence for a foundational empiricist the result of an individuation test must not

45 Duhem (1991 [1914], 187)

68

depend on any background assumption unless that assumption can be tested empirically. As

I will argue, foundational empiricist criteria individuate quantities far too finely, leading to a

fruitless multiplication of natural laws. Such accounts of measurement are unhelpful in

shedding light on the way quantities are individuated by successful scientific theories.

2.3.1. Bridgman’s operationalism

The first account of quantity individuation I will consider is operationalism as

expounded by Bridgman (1927). Bridgman proposes to define quantity concepts in physics

such as length and temperature by the operation of their measurement. This proposal leads

Bridgman to claim that currently accepted quantity concepts have ‘joints’ where different

operations overlap in their value range or object domain. He warns against dogmatic faith in

the unity of quantity concepts across these ‘joints’, urging instead that unity be checked

against experiments. Bridgman nevertheless concedes that it is pragmatically justified to

retain the same name for two quantities if “within our present experimental limits a

numerical difference between the results of the two sorts of operations has not been

detected” (ibid, 16.)

Bridgman can be said to advance two distinct criteria of quantity individuation, the

first substantive and semantic, the other nominal and empirical. The first criterion is a direct

consequence of the operationalist thesis: quantities are individuated by the operations that

define them. Hence a difference in measurement operation is a sufficient condition for a

difference in the quantity being measured. But even if we grant Bridgman the existence of a

clear criterion for individuating operations, the operationalist approach generates an absurd

69

multiplicity of quantities and laws. Unless ‘operation’ is defined in a question-begging

manner, there is no reason to think that operating a ruler and operating an interferometer

(both used for measuring length) are instances of a single sort of operation. Bridgman, of

course, welcomed the multiplicity of quantity concepts in the spirit of empiricist caution.

Nevertheless, it is doubtful whether the sort of caution Bridgman advised is being served by

his operational analysis of quantity. As long as quantities are defined by operations, no two

operations can measure the same quantity; as a result, it is impossible to distinguish between

results that are ascribable to the objects being measured and those that are ascribable to

some feature of the operation itself, the environment, or the human operator. An

operational analysis, in other words, denies the possibility of testing the objective validity of

measurement claims – their validity as claims about measured objects. This denial stands in

stark contrast to Bridgman’s own cautionary methodological attitude.

Bridgman’s second, empirical criterion of individuation is meant to save physical

theory from conceptual fragmentation. According to the second criterion, quantities are

nominally individuated by the presence of agreement among the results of operations that

measure them. The same ‘nominal quantity’, such as length, is said to be measured by several

different operations as long as no significant discrepancy is detected among the results of

these operations. But this criterion is naive, because different operations that are thought to

measure the same quantity rarely agree with each other before being deliberately corrected

for systematic errors. Such corrections are required, as we have seen, even after one averages

indications over many repeated operations and ‘normalizes’ their scale. The empirical

criterion of individuation therefore fails to avoid the unnecessary multiplicity of quantities.

Alternatively, if by ‘numerical difference’ above Bridgman refers to measurement results that

have already been corrected for systematic errors, such numerical difference can only be

70

evaluated under the presupposition that the two operations measure the same quantity. This

presupposition is nevertheless the very claim that Bridgman needs to establish. This last

reading of Bridgman’s individuation criterion is therefore circular46.

2.3.2. Ellis’ conventionalism

A second and seemingly more promising candidate for an empiricist criterion of

quantity individuation is provided by Ellis in his Basic Concepts of Measurement (1966). Instead

of defining quantity concepts in terms of particular operations, Ellis views quantity concepts

as ‘cluster concepts’ that may be “identified by any one of a large number of ordering

relationships” (ibid, 35). Different instruments and procedures may therefore measure the

same quantity. What is common to all and only those procedures that measure the same

quantity is that they all produce the same linear order among the objects being measured: “If

two sets of ordering relationships, logically independent of each other, always generate the

same order under the same conditions, then it seems clear that we should suppose that they

are ordering relationships for the same quantity” (ibid, 34).

Ellis’ individuation criterion appears at first to capture the examples examined so far.

The thermometers discussed above preserve the order of samples regardless of the

46 Note that my grounds for criticizing Bridgman differ significantly from the familiar line of criticism expressed by Hempel (1966, 88-100). Hempel rejects the proliferation of operational quantity-concepts insofar as it makes the systematization of scientific knowledge impossible. In this respect I am in full agreement with Hempel. But Hempel fails to see that Bridgman’s nominal criterion of quantity individuation is not only opposed to the systematic aims of science but also blatantly circular. Like Bridgman, Hempel wrongly believed that agreement and disagreement among measuring instruments are adjudicated by a comparison of indications, or ‘readings’ (ibid, 92). The circularity of Bridgman’s criterion is exposed only once the focus shifts from instrument indications to measurement outcomes, which already incorporate error corrections. I will elaborate on this distinction below.

71

thermometric fluid used: a sample that is deemed warmer than another by one thermometer

will also be deemed warmer by the others. Similarly, two atomic clocks whose frequencies

are unstable relative to each other still preserve the order of events they are used to record,

barring relativistic effects.

Nevertheless, Ellis’ criterion fails to capture the case of intervals and ratios of

measurable quantities. Quantity intervals and quantity ratios are themselves quantities, and

feature prominently in natural laws. Indeed, Ellis himself mentions the measurement of

time-intervals and temperature-intervals47 and treats them as examples of quantities. As we

have seen, when genuine systematic errors occur, measurement procedures do not preserve

the order of intervals and ratios. Two temperature intervals deemed equal by one

thermometer are deemed unequal by another depending on the thermometric fluid used.

Note that this discrepancy persists far above the sensitivity thresholds of the instruments

and cannot be attributed to resolution limitations.

A similar situation occurs with clocks. Consider two clocks, one whose ‘ticks’ are

slowly increasing in frequency relative to a standard, the other slowly decreasing. Now

imagine that each of these clocks is used to measure the frequency of the standard.

Relativistic effects aside, the speeding clock will indicate that the time intervals marked by

the standard are slowly increasing while the slowing clock will indicate that they are

decreasing – a complete reversal of order of time intervals. Ellis’ criterion is therefore

insufficient to decide whether or not the two clocks measure intervals of the same quantity

(i.e. time.) Considered in light of this criterion alone, the clocks may just as well be deemed

to measure intervals of two different and anti-correlated quantities, time-A and time-B. But

47 Ibid, 44 and 100.

72

this is absurd, and again leaves open the possibility of unnecessary multiplication of

quantities and laws encountered in Bridgman’s case.

As with Bridgman, Ellis cannot defend his criterion by claiming that it applies only to

ordering relationships that have already been appropriately corrected for systematic errors.

For as we have seen, such corrections can only be made in light of the presupposition that

the relevant procedures measure the same quantity, and this is the very claim Ellis’ criterion

is supposed to establish.

Ellis may retort by claiming that his criterion is intended to provide only necessary, but

not sufficient, conditions for quantity individuation. This would be of some consolation if

the condition specified by Ellis’ criterion – namely, the convergence of linear order – was

commonly fulfilled whenever scientists compare measuring instruments to each other. But

almost all comparisons among measuring instruments in the physical sciences are expressed

in terms of intervals or ratios of outcomes, and we saw that Ellis’ criterion is not generally

fulfilled for intervals and ratios. Moreover, virtually all known laws of physics are expressed

in terms of quantity intervals and ratios. The discovery and confirmation of nomic relations,

which are among the primary aims of physics, require individuation criteria that are

applicable to intervals and ratios of quantities, but these are not covered by Ellis’ criterion.

73

2.3.3. Representational Theory of Measurement

Perhaps the best known contemporary philosophical account of measurement is the

Representational Theory of Measurement (RTM)48. Unlike the two previous accounts, RTM

does not explicitly discuss the individuation of quantities. Nevertheless, RTM discusses at

length the individuation of types of measurement scales. A scale type is individuated by the

transformations it can undergo. For example, the Celsius and Fahrenheit scales belong to the

same type (‘interval’ scales) because they have the same set of permissible transformations,

i.e. linear transformations with an arbitrary zero point49. The set of permissible

transformations for a given scale is established by proving a ‘uniqueness theorem’ for that

scale, a proof that rests on axioms concerning empirical relations among the objects

measured on that scale50.

RTM can be used to generate an objection to my analysis of systematic error.

According to this objection, the discrepancies I call ‘genuine systematic errors’ are simply

cases where the same quantity is measured on different scales. For example, the

discrepancies between mercury and alcohol thermometers arise because these instruments

represent temperature on different scales (one may call them the ‘mercury temperature scale’

and ‘alcohol temperature scale’.) RTM shows that these scales belong to the same type –

namely, interval scales. Moreover, RTM supposedly provides us with a conversion factor

that transforms temperature estimates from one scale to the other, and this conversion

eliminates the discrepancies. ‘Genuine systematic errors’, according to this objection, are not

48 Krantz et al, 1971. 49 Ibid, 10. 50 Ibid.

74

errors at all but merely byproducts of a subtle scale difference. RTM eliminates these

byproducts before the underdetermination problem I mention has a chance to arise.

Like the proposals by Bridgman and Ellis, this objection is circular. It purports to

eliminate genuine systematic errors by appealing to differences in measurement scale, but

any test for identifying differences in measurement scale must already presuppose that

genuine systematic errors have been corrected.

This is best illustrated by considering a variant of the problem of quantity

individuation. As before, we may assume that scientists are faced with apparent

disagreement between the outcomes of different measurements. However, in this variant

scientists are entertaining four possible explanations instead of just three:

(H1) the instruments are not measuring the same quantity;

(H1S) measurement outcomes are represented on different scales;

(H2) systematic errors have been inappropriately evaluated; or

(H3) measurement uncertainty has been underestimated.

According to the objection, hypothesis (H1S) can be tested independently of the other

three hypotheses. In other words, facts about the appropriateness and uniqueness of a scale

employed in measurement can be tested independently of questions about what, and how

accurately, the instrument is measuring. This is yet another conceptual foundationalist claim,

i.e. the claim that the concept of measurement scale is fundamental and therefore has

universal criteria of existence and identity.

If taken literally, conceptual foundationalism about measurement scales leads to the

same absurd multiplication of quantities already encountered above. This is because genuine

75

systematic errors by definition cannot be transformed away through alterations of

measurement scale. In the thermometry case, for example, the nonlinear discrepancy

between mercury and alcohol thermometers cannot be eliminated by transformations of the

interval scale, as the latter only admits of linear transformations. One is forced to conclude

that the thermometers are measuring temperature on different types of scales – a ‘mercury

scale type’ and an ‘alcohol scale type’ – with no permissible transformation between them. But this

conclusion is inconsistent with RTM, according to which both scales are interval scales and

hence belong to the same type. How can temperature be measured on two different interval

scales without there being a permissible transformation between them? The only way to

avoid inconsistency is to admit that the so-called ‘thermometers’ are not measuring the same

quantity after all, but two different and nonlinearly related quantities. Hence strict

conceptual foundationalism about measurement scales leads to the same sort of

fragmentation of quantity concepts already familiar from Bridgman and Ellis’ accounts. If

RTM is interpreted along such strict empiricist lines, it can provide very little insight into the

way measurement scales are employed in successful cases of scientific practice.

A second and supposedly more charitable option is to interpret RTM as applying to

indications in the idealized sense, already taking into account error corrections. This is

compatible with the views expressed by authors of RTM, who state that their notion of

‘empirical structure’ should be understood as an idealized model of the data that already

abstracts away from biases51. But on this reading the objection becomes circular. RTM’s

proofs of uniqueness theorems, which according to the objection are supposed to make

51 See Luce and Suppes (2002, 2).

76

corrections to genuine systematic errors redundant, presuppose that these corrections have

already been applied.

Not only the objection, but RTM itself becomes circular under this reading. RTM,

recall, aims to provide necessary and sufficient conditions for the appropriateness and

uniqueness of measurement scales. According to the so-called ‘charitable’ reading just

discussed, these conditions are specified under the assumption that measurement errors have

already been corrected. In other words, any test of (H1S) can only be performed under the

assumption that (H2) and (H3) have already been rejected. But any test of (H2) or (H3) must

already represent measurement outcomes on some measurement scale, for otherwise

quantitative error correction and uncertainty evaluation are impossible. In other words, the

representational appropriateness of a scale type must already be presupposed in the process of

obtaining idealized empirical relations among measured objects.52 Consequently these

empirical relations cannot be used to test the representational appropriateness of the scale

type being used. Instead, (H1S) is epistemically entangled with (H2) and (H3) and ipso facto

with (H1). The project of establishing the appropriateness and uniqueness of measurement

scales based on nothing but observable evidence is caught in a vicious circle.

The so-called ‘charitable’ reading of RTM fails at being charitable enough because it

takes RTM to be an epistemological theory of measurement. Those who read RTM in this light

expect it to provide insight into the way claims to the appropriateness and uniqueness of

measurement scales may be tested by empirical evidence. The authors of RTM occasionally

52 This last point has also been noted by Mari (2000), who claims that “the [correct] characterization of measurement is intensional, being based on the knowledge available about the measurand before the accomplishment of the evaluation. Such a knowledge is independent of the availability of any extensional information on the relations in [the empirical relational structure] RE” (ibid, 74-5, emphases in the originial).

77

make comments that encourage this expectation from their theory53. But this expectation is

unfounded. As we have just seen, the justification for one’s choice of measurement scale

cannot be abstracted away from considerations relating to the acquisition and correction of

empirical data. Any test for the appropriateness of scales that does not take into account

considerations of this sort is bound to be circular or otherwise multiply quantities

unnecessarily. Given that RTM remains silent on considerations relating to the acquisition

and processing of empirical evidence, it cannot be reasonably expected to function as an

epistemological theory of measurement.

Under a third, truly charitable reading, RTM is merely meant to elucidate the

mathematical presuppositions underlying measurement scales. It is not concerned with

grounding empirical knowledge claims but with the axiomatization of a part of the

mathematical apparatus employed in measurement. Stripped from its epistemological guise,

RTM avoids the problem of quantity individuation. But the cost is substantial: RTM can no

longer be considered a theory of measurement proper, for measurement is a knowledge-

producing activity, and RTM does not elucidate the structure of inferences involved in

making knowledge claims on the basis of measurement operations. In other words, RTM

explicates the presuppositions involved in choosing a measurement scale but not the

empirical criteria for the adequacy of these presuppositions. RTM’s role with respect to

measurement theory is therefore akin to that of axiomatic probability theory with respect to

53 For example, the authors of RTM seem to suggest that empirical evidence justifies or confirms the axioms: “One demand is for the axioms to have a direct and easily understood meaning in terms of empirical operations, so simple that either they are evidently empirically true on intuitive grounds or it is evident how systematically to test them.” (Krantz et al 1971, 25)

78

quantum mechanics: both accounts supply rigorous analyses of indispensible concepts (scale,

probability) but not the conditions of their empirical application.

To summarize this section, the foundational empiricist attempt to specify a test of

quantity individuation (or scale type individuation) in terms of nothing more than relations

among observable indications of measuring instruments fails. And fail it must, because

indications themselves are insufficient to determine whether instruments measure different

(but correlated) quantities or the same quantity with some inaccuracy. The next section will

outline a novel epistemology of measurement, one that rejects foundationalism and dissolves

the problem of quantity individuation.

2.4. A model-based account of measurement

2.4.1. General outline

According to the account I will now propose, physical measurement is the coherent

and consistent attribution of values to a quantity in an idealized model of a physical process.

Such models embody theoretical assumptions concerning relevant processes as well as

statistical assumptions concerning the data generated by these processes. The physical

process itself includes all actual interactions among measured samples, instrument, operators

and environment, but the models used to represent such processes neglect or simplify many

of these interactions. It is only in light of some idealized model of the measuring process

that measurement outcomes can be assessed for accuracy and meaningfully compared to

each other. Indeed, it is only against the background of such simplified and approximate

79

representation of the measuring process that measurement outcomes can even be considered

candidates for objective knowledge.

To appreciate this last point in full, it is useful to distinguish between the indications (or

‘readings’) of an instrument and the outcomes of measurements performed with that

instrument. This distinction has already been implicit in the discussion above, but the model-

based view makes it explicit. Examples of indications are the height of a mercury column in

a barometer, the position of a pointer relative to the dial of an ammeter, and the number of

cycles (‘ticks’) generated by a clock during a given sampling period. More generally, an

indication is a property of an instrument in its final state after the measuring process has been

completed. The indications of instruments do not constitute measurement outcomes, and in

themselves are no different than the final states of any other physical process54. What gives

indications special epistemic significance is the fact that they are used for inferring values of

a quantity based on a model of the measurement process, a model that relates possible

indications to possible values of a quantity of interest. These inferred estimates of quantity

values are measurement outcomes. Examples are estimates of atmospheric pressure, electric

current and duration inferred from the abovementioned indications. Measurement outcomes

are expressed on a determinate scale and include associated uncertainties, although

sometimes only implicitly.

A hallmark of the model-based approach to measurement is that models are viewed as

preconditions for obtaining an objective ordering relation among measured objects. We already

saw, for example, that the ordering of time intervals or temperature intervals obtained by

54 Indications may be divided into ‘raw’ and ‘processed’, the latter being numerical representations of the former. Neither processed nor raw indications constitute measurement outcomes. For further discussion see Chapter 4, Section 4.2.2.

80

operating a clock or a thermometer depends on how scientists represent the relationship

between indications and values of the quantity being measured. Such ordering is a

consequence of modeling the instrument in a particular way and assigning systematic

corrections to its indications accordingly. Contrary to empiricist theories of measurement,

then, the ordering of objects with respect to the quantity being measured is never simply

given through observation but must be inferred based on a model of the measuring process.

Prior to such model-based inference, the ‘raw’ ordering of objects by the indications of an

empirical operation is nothing more than a local regularity that may just as plausibly be

ascribed to an idiosyncrasy of the instrument, the environment or the human operator as to

the objects being ordered.

This last claim is not meant as a denial of the existence of theory-free operations for

ordering objects, e.g. placing pairs of objects on the pans of an equal-arms balance.

However, such operations on their own do not yet measure anything, nor is measurement

simply a matter of mapping the results of such operations onto numbers. Measurement

claims, recall, are claims to objective knowledge – meaning that order is ascribed to measured

objects rather than to artifacts of the specific operation being used. Grounding such claim to

objectivity involves differentiating operation-specific features from those that are due to a

pertinent difference among measured samples. As we already saw, different procedures that

supposedly measure the same quantity often produce inconsistent, and in some cases even

completely reversed, ‘raw’ orderings among objects. Such orderings must therefore be

considered operation-specific and cannot be taken as measurement outcomes.

To obtain a measurement outcome from an indication, a distinction must be drawn

between pertinent aspects of the measured objects and procedural artifacts. This involves

the development of what is sometimes called a ‘theory of the instrument’, or more exactly an

81

idealized model of the measurement process, from theoretical and statistical assumptions.

Such models allow scientists to account for the effects of local idiosyncrasies and correct the

outcomes accordingly. Unlike the ‘raw’ order indicated by an operation, the order resulting

from a model-based inference has the proper epistemic credentials to ground objective

claims to measurement, because it is based on coherent assumptions about the object (or

process, or event) being measured.

Not any estimation of a quantity value in an idealized model of a physical process is a

measurement. Rather, a measurement is based on a model that coheres with background

theoretical assumptions, and is consistent with other measurements of the same or related

quantities performed under different conditions. As a result, what counts as an instance of

measurement may change when assumptions about relevant quantities, instruments or

modeling practices are modified.

All measurement outcomes are relative to an abstract and idealized representation of

the procedure by which they were obtained. This explains how the outcomes of a

measurement procedure can change without any physical modification to that procedure,

merely by changing the way the instrument is represented. Similarly, the model-based

approach explains how the accuracy of a measuring instrument can be improved merely by

adding correction terms to the model representing the instrument. Thirdly, the model-

relativity of measurement outcomes explains how the same set of operations, again without

physical change, can be used to measure different quantities on different occasions

depending on the interests of researchers. An example is the use of the same pendulum to

measure either duration or gravitational potential without any physical change to the

pendulum or to the procedures of its operation and observation. The change is effected

merely by a modification to the mathematical manipulation of quantities in the model. For

82

measuring duration, researchers plug in known values for gravitational potential in their

model of the pendulum and use the indications of the pendulum (i.e. number of swings) to

tell the time, whereas measuring gravitational potential involves the opposite mathematical

procedure.

The notions of accuracy and error are similarly elucidated in relation to models. The

accuracy of a measurement procedure is determined by the accuracy of model-based predictions

regarding that procedure’s outcomes. That is, a measurement procedure is accurate relative

to a given model if and only if the model accurately predicts the outcomes of that procedure

under a given set of circumstances55. Similarly, measurement error is evaluated as the

discrepancy between such model-based predictions and standard values of the quantity in

question. Such errors include, but are not limited to, discrepancies that can be estimated by

statistical analysis of repeated measurements. In attributing claims concerning accuracy and

error to predictions about instruments, rather than directly to instruments themselves, the

model-based account makes explicit the inferential nature of accuracy and error (see also

Hon 2009.)56

55 The accuracy of model-based predictions is evaluated by propagating uncertainties from ‘input’ quantities to ‘output’ quantities in the model, as will be clarified in Chapter 4.

56 Commenting on Hertz’ 1883 cathode ray experiments, Hon writes: “The error we discern in Hertz’ experiment cannot be associated with the physical process itself […]. Rather, errors indicate claims to knowledge. An error reflects the existence of an argument into which the physical process of the experiment is cast.” (2009, 21)

83

2.4.2. Conceptual quantity individuation

According to the model-based approach, a physical quantity is a parameter in a theory

of a kind of physical system. Specifically, a measurable physical quantity is a theoretical

parameter whose values can be related in a predictable manner to the final states of one or

more physical processes. A measurable quantity is therefore defined by a background theory

(or theories), which in turn inform the construction of models of particular processes

intended to measure that quantity.

The model-based approach is not committed to a particular metaphysical standpoint

on the reality of quantities. Whether or not quantities correspond to mind-independent

properties is seen as irrelevant to the epistemology of measurement, that is, to an analysis of

the evidential conditions under which measurement claims are justified. This is not meant to

deny that scientists often think of the quantities they measure as representing mind-

independent properties and that this way of thinking is fruitful for the development of

accurate measurement procedures. But whether or not the quantities scientists end up

measuring in fact correspond to mind-independent properties makes no difference to the

kinds of tests scientists perform or the inferences they draw from evidence, for scientists

have no access to such putative mind-independent properties other than through empirical

evidence57. As will become clear below, the model-based approach allows one to talk

coherently about accuracy, error and objectivity as properties of measurement claims without

57 My agnosticism with respect to the existence of mind-independent properties does not, of course, imply agnosticism with respect to the existence of objects of knowledge and the properties they posses qua objects of knowledge. A column of mercury has volume insofar as it can be reliably perceived to occupy space. I therefore accept a modest form of epistemic (e.g. Kantian) realism.

84

committing to any particular metaphysical standpoint concerning the truth conditions of

such claims. The model-based approach does, however, make a distinction among quantities

in terms of their epistemic status. The epistemic status of physical quantities varies from

merely putative to deeply entrenched depending on the demonstrated degree of success in

measuring them. As mentioned, to successfully measure a quantity is to estimate its values in

a consistent and coherent manner based on models of physical processes.

The model-based view provides a straightforward account of quantity individuation

that dissolves the underdetermination problem discussed above. In order to individuate

quantities across measuring procedures, one has to determine whether the outcomes of

different procedures can be consistently modeled in terms of the same parameter in the background

theory. If the answer is ‘yes’, then these procedures measure the same quantity relative to those

models.

A few clarifications are in order. First, by ‘consistently modeled’ I mean that outcomes

of different procedures converge within the uncertainties predicted by their respective

models. A detailed example of this sort of test was discussed in Chapter 1. Second, the

phrase ‘same parameter in the background theory’ requires clarification. A precondition for

even testing whether two instruments provide consistent outcomes is that the outcomes of

each instrument are represented in terms of the same theoretical parameter. By ‘same

theoretical parameter’ I mean a parameter that enters into approximately the same relations

with other theoretical parameters.58 The requirement to model outcomes in terms of the

same theoretical quantity therefore amounts to a weak requirement for nomic coherence among

58 This definition is recursive, but as long as the model has a finite number of parameters the recursion bottoms out. A more general definition is required for models with infinitely many parameters.

85

models specified in terms of that quantity, rather than to a strong requirement for identity of

extension or intension among quantity terms59.

The emphasis on theoretical models may raise worries as to the status of pre-

theoretical measurements. After all, measurements were performed long before the rise of

modern physics. However, even when a full-fledged theory of the measured quantity is

missing or untrustworthy, some pre-theoretical background assumptions are still necessary

for comparing the outcomes of measurements. When in the 1840s Regnault made his

comparisons among thermometers he eschewed all assumptions concerning the nature of

caloric and the conservation of heat (Chang 2004, 77) but he still had to presuppose that

temperature is a single-valued quantity, that it increases when an object is exposed to heat

source, and that an increase of temperature under constant pressure is usually correlated

with expansion. These background assumptions informed the way Regnault modeled his

instruments. Indeed, independently of these minimal assumptions the claim that Regnault’s

instruments measured the same quantity cannot be tested.

To summarize the individuation criterion offered by the model-based approach, two

procedures measure the same quantity only relative to some way of modeling those

procedures, and if and only if their outcomes are shown to be modeled consistently and

coherently in terms of the same theoretical parameter.

It is now time to clarify how this criterion deals with the problem of individuation

outlined earlier in this chapter. As mentioned, the problem of quantity individuation has two

distinct consequences that raise different sorts of challenges: one conceptual and the other

practical. On the conceptual level it is only a problem for foundational accounts of

59 For a recent proposal to individuate quantity concepts in this way see Diez (2002, 25-9).

86

measurement, namely those that attempt to specify theory-free individuation criteria for

measurable quantities. The model-based approach dissolves the conceptual problem by

resisting the temptation to offer foundational criteria of quantity individuation. The identity

of quantities across measurement procedures is relative to background assumptions, either

theoretical or pre-theoretical, concerning what those procedures are meant to measure.

Genuinely discrepant thermometers, for example, measure the same quantity only relative to

a theory of temperature, or before such theory is available, relative to pre-theoretical beliefs

about temperature. Similarly, different cesium atomic clocks measure the same quantity only

relative to some theory of time such as Newtonian mechanics or general relativity, or

otherwise relative to some pre-theoretical conception of time.

Even relative to a given theory metrologists sometimes have a choice as to whether or

not they represent instruments as measuring the same quantity. Relative to general relativity,

for example, atomic clocks placed at different heights above sea level measure the same

coordinate time but different proper times. Quantity individuation therefore depends on which

of these two quantities the clocks are modeled as measuring. The choice among different

ways of modeling a given instrument involves a difference in the systematic correction

applied to its indications. The latter point has already been illustrated in the case of the

Michelson-Morely apparatus, but it holds even in more mundane cases that do not involve

theory change. To return to the clock example, cesium fountain clocks that are represented

as measuring proper time do not require correction for gravitational red-shifts. The

discrepancy among their results is attributed to the fact that they occupy different reference

frames and therefore measure different proper times relative to those frames. On the other

hand, when the same clocks are represented as measuring the same coordinate time on the

87

geoid (an imaginary sphere of equal gravitational potential that roughly corresponds to the

earth’s sea level) their indications need to be corrected for a gravitational red-shift.

As already noted, the distribution of systematic errors among measurement procedures

and the individuation of quantities measured by those procedures are but two sides of the

same epistemic coin. Which side of the coin the scientific community will focus on when

resolving the next discrepancy depends on the particular history of its theoretical and

practical development. This conclusion stands in direct opposition to foundational

approaches, which attempt to provide sufficient conditions for establishing the identity of

measurable quantities independently of any particular scientific theory or experiment. The

model-based approach, by contrast, treats criteria for the individuation of quantities as

already embedded in some theoretical and material setting from the start. Claims concerning

the individuation of quantities are underdetermined by the evidence only in principle, when

such claims are viewed ‘from nowhere.’ But to view such claims independently of their

particular theoretical and material context is to misunderstand how measurement produces

knowledge. Measurement outcomes are the results of model-based inferences, and owe their

objective validity to the idealizing assumptions that ground such inferences. In the absence

of such idealizations, there is no principled way of telling whether discrepancies should be

attributed to the objects being measured or to extrinsic factors.

The search for theory-free criteria of quantity individuation is therefore opposed to the

very supposition that measurement provides objective knowledge. Such foundational

pursuits sprout from a conflation between instrument indications, which constitute the

empirical evidence for making measurement claims, and measurement outcomes, which are

value estimates that constitute the content of these claims. Once the conflation is pointed out,

it becomes clear that the background assumptions involved in inferring outcomes from

88

indications play a necessary and legitimate role in grounding claims about quantity

individuation, whereas the ‘raw’ evidence alone cannot and should not be expected to do so.

2.4.3. Practical quantity individuation

In addition to dissolving the conceptual problem of quantity individuation, the model-

based approach to measurement also sheds light on possible solutions to the practical

problem of quantity individuation, a task that is beyond the purview of other philosophical

theories of measurement. The practical problem, recall, is that of selecting which of the three

hypotheses (H1) – (H3) above to accept when faced with genuinely discrepant measurement

outcomes. Laboratory scientists are habitually confronted with this sort of challenge,

especially if they work in the forefront of accurate measurement where existing standards

cannot settle the issue.

A common solution to the practical problem of quantity individuation is to accept only

(H3), the hypothesis that measurement uncertainty has been underestimated, and enhance

the stated uncertainties so as to achieve compatibility among results. This is equivalent to

extending uncertainty bounds (sometimes mistakenly called ‘error bars’) associated with

different outcomes until the outcomes are statistically compatible. It is common to use

formal measures of statistical compatibility such as the Birge ratio60 to assess the success of

adjustments to stated uncertainties. Agreement is restored either by re-evaluating type-B

uncertainties associated with measuring procedures, by modifying statistical models of noise,

60 See Birge (1932). Henrion & Fischhoff (1986, 792) provide a concise introduction to the Birge ratio.

89

or by increasing the stated uncertainty ad hoc based on ‘educated guesses’ as to which

procedures are less accurate. Regardless of the technique of adjustment, the disadvantage of

accepting only (H3) is that the increase in uncertainty required to recover agreement is

similar in magnitude to the discrepancy among the outcomes. If the discrepancy is large

relative to the initially stated uncertainty, this strategy results in a large increase of stated

uncertainties.

Another option is to accept only (H2), the hypothesis that a systematic bias influences

the outcomes of some of the measurements. In some cases such bias may be corrected by

physically controlling its source, e.g. by shielding against background effects. This strategy is

nevertheless limited by the fact that not all sources of systematic bias are controllable (such

as the presence of nonzero gravitational potential on earth) and that others can only be

controlled to a limited extent. Moreover, for older measurements the apparatus may no

longer be available and attempts to recreate the apparatus may not succeed in reproducing its

idiosyncrasies. For these reasons, systematic biases are often corrected only numerically, i.e.

by modifying the theoretical model of the instrument with a correction factor that reflects

the best estimate of the magnitude of the bias. Because accuracy is ascribable to

measurement outcomes rather than to instrument indications, a model-based correction that

modifies the outcome is a perfectly legitimate tool for enhancing accuracy, even if it has no

effect on the indications of the instrument.

A third strategy for handling the practical challenge of quantity individuation, one that

the model-based approach is especially useful in elucidating, involves accepting all three

hypotheses (H1), (H2) and (H3) – namely, accepting that the instruments (as initially

modeled) measure different quantities, that a systematic error is present that has not been

appropriately corrected and that measurement uncertainties have been underestimated.

90

Agreement is then restored by a method I call ‘unity through idealization’, a method that is

central to the work of metrologists because it restores agreement with a relatively small loss

of accuracy and without necessarily involving physical interventions.

The core idea behind this method is known as Galilean idealization (McMullin 1985).

Galileo’s famous measurements of free-fall acceleration were performed on objects rolling

down inclined planes. This replacement of experimental object was made possible by an

idealization: acceleration on an inclined plane is an imperfect version of free-fall acceleration

in a vacuum. To measure free-fall acceleration, one does not have to experiment on a free-

falling object in a vacuum but merely to conceptually remove the effects of impediments

such as the plane, air resistance etc. from an abstract representation of the rolling object.

More generally, the principle of unity through idealization is this: the same quantity can be

measured in different concrete circumstances so long as these circumstances are represented

as approximations of the same ideal circumstances.

This principle is utilized to restore agreement among seemingly divergent

measurement outcomes. For example, when cesium fountain clocks are found to

systematically disagree in their outcomes, it is occasionally possible to resolve the

discrepancy by further idealizing the theoretical models representing these clocks. The

discrepancy is attributed to the fact that the clocks were not measuring the same quantity in

the less idealized representation. For example, the discrepancy is attributed to the fact that

clocks were measuring different frequencies of cesium, the difference being caused by the

presence of different levels of background thermal radiation. Instead of physically equalizing

the levels of background radiation across clocks, the clocks are conceptually re-modeled so

91

as to measure the ideal cesium frequency in the absence of thermal background, i.e. at a

temperature of absolute zero61. Under this new and further idealized representation of the

clocks, metrologists are justified in applying a correction factor to the model of each clock

that reflects their best estimate of the effect of thermal radiation on the indications of that

clock. This correction involves a type-B uncertainty that is added to the total uncertainty of

each clock, but this new uncertainty is typically much smaller than the discrepancy being

corrected. When successful, this strategy leads to the elimination of discrepancies with only a

small loss of accuracy, and with no physical modification to the apparatus.

2.5. Conclusion: error as a conceptual tool

Philosophers of science have traditionally sought to analyze what they took to be basic

concepts of measurement independently of any particular scientific theory, experimental

tradition or instrument. This approach has proved fruitful for the axiomatization of

measurement scales, but as an approach to the epistemology of measurement, i.e. to the study

of the conditions under which measurement claims are justified in light of possible evidence,

conceptual foundationalism encounters severe limitations. This chapter was dedicated to the

discussion of one such limitation of conceptual foundationalism, namely to its attempt to

answer questions of the form ‘what does procedure P measure?’ independently of questions

of the form ‘how accurate is P?’ As I have shown, the two sorts of questions are

epistemically entangled, such that no empirical test can be devised that would answer one

61 For a detailed discussion of the modeling of cesium fountain clocks see Chapter 1.

92

without at the same time answering the other. Moreover, the choice of answers to both

questions depends on background theories and on traditions of interpreting evidence that

are accepted by the scientific community. Independently of such theories and traditions the

indications of measuring instruments are devoid of epistemic significance, i.e. cannot be

used to ground claims about the objects being measured.

The model-based approach offered here acknowledges the context-dependence of

measurement claims and dissolves the worries of underdetermination associated with

conceptual foundationalism. More importantly, the model-based approach clarifies how the

use of idealizations allows scientists to ground claims to the unity of quantity concepts. The

unity of quantity concepts across different measurement procedures rests on scientists’

success in consistently and coherently modeling these procedures in terms of the same

theoretical parameter. This treatment of quantity individuation clarifies several aspects of

physical measurement that have hitherto been neglected or poorly understood by

philosophers of science, most notably the notion of systematic error. Far from merely being

a technical concern for laboratory scientists, the possibility of systematic error is a central

conceptual tool in coordinating theory and experiment. Genuine systematic errors constitute

the conceptual ‘glue’ that allows scientists to model different instruments in terms of a single

quantity despite nonscalable discrepancies among their indications. The applicability of

quantity concepts across different domains, and hence the generality of physical theory, owe

their existence to the possibility of distributing systematic errors among the indications of

measuring instruments.

93


3. Making Time: A Study in the Epistemology of Standardization

Abstract: Contemporary timekeeping is an extremely successful standardization project, with most national time signals agreeing well within a microsecond. But a close look at methods of clock synchronization reveals a patchwork of ad hoc corrections, arbitrary rules and seemingly circular inferences. This chapter offers an account of standardization that makes sense of the stabilizing role of such mechanisms. According to the model-based account proposed here, to standardize a quantity is to legislate the proper mode of application of a quantity-concept to a collection of exemplary artifacts. This legislation is performed by specifying a hierarchy of models of these artifacts at different levels of abstraction. I show that this account overcomes limitations associated with conventionalist and constructivist explanations for the stability of networks of standards.

3.1. Introduction

The reproducibility of quantitative results in the physical sciences depends on the

availability of stable measurement standards. The maintenance, dissemination and

improvement of standards are central tasks in metrology, the science of reliable measurement.

With the guidance of the International Bureau of Weights and Measures (Bureau International

des Poids et Mesures or BIPM) near Paris, a network of metrological institutions around the

globe is responsible for the ongoing comparison and adjustment of standards.

94

Among the various standardization projects in which metrologists are engaged,

contemporary timekeeping is arguably the most successful, with the vast majority of national

time signals agreeing well within a microsecond and stable to within a few nanoseconds a

month62. The standard measure of time currently used in almost every context of civil and

scientific life is known as Coordinated Universal Time or UTC63. UTC is the product of an

international cooperative effort by time centers that themselves rely on state-of-the-art

atomic clocks spread throughout the globe. These clocks are designed to measure the

frequencies associated with specific atomic transitions, including the cesium transition,

which has defined the second since 1968.

What accounts for the overwhelming stability of contemporary timekeeping standards?

Or, to phrase the question somewhat differently, what factors enable a variety of

standardization laboratories around the world to so closely reproduce Coordinated Universal

Time? The various explanans one could offer in response to this question may be divided

into two broad kinds. First, one could appeal to the natural stability, or regularity, of the

atomic clocks that contribute to world time. Second, one could appeal to the practices by

which metrological institutions synchronize these atomic clocks. The adequate combination

of these two sorts of explanans and the limits of their respective contribution to stability are

contested issues among philosophers and sociologists of science. This chapter will discuss

three accounts of standardization along with the explanations they offer for the stability of

62 Barring time zone and daylight saving adjustments. See BIPM (2011) for a sample comparison of national approximations to UTC.

63 UTC replaced Greenwich Mean Time as the global timekeeping reference in 1972. The acronym ‘UTC’ was chosen as a compromise to avoid favoring the order of initials in either English (CUT) of French (TUC).

95

UTC. Each account will assign different explanatory roles to the social and natural factors

involved in stabilizing timekeeping standards.

The first kind of explanation is inspired by conventionalism as expounded by Poincaré

([1898] 1958), Reichenbach ([1927] 1958) and Carnap ([1966] 1995). According to

conventionalists, metrologists are free to choose which natural processes they use to define

uniformity, namely, to define criteria of equality among time intervals. Prior to this choice,

which is in principle arbitrary, there is no fact of the matter as to which of two given clocks

‘ticks’ more uniformly. The choice of natural process (e.g. solar day, pendulum cycle, or

atomic transition) depends on considerations of convenience and simplicity in the

description of empirical data. Once a ‘coordinative definition’ of uniformity is given, the

truth or falsity of empirical claims to uniformity is completely fixed: how uniformly a given

clock ‘ticks’ relative to currently defined criteria is a matter of empirical fact. In Carnap’s

own words:

If we find that a certain number of periods of process P always match a certain number of periods of process P’, we say that the two periodicities are equivalent. It is a fact of nature that there is a very large class of periodic processes that are equivalent to each other in this sense. (Carnap [1966] 1995, 82-3, my emphasis) We find that if we choose the pendulum as our basis of time, the resulting system of physical laws will be enormously simpler than if we choose my pulse beat. […] Once we make the choice, we can say that the process we have chosen is periodic in the strong sense. This is, of course, merely a matter of definition. But now the other processes that are equivalent to it are strongly periodic in a way that is not trivial, not merely a matter of definition. We make empirical tests and find by observation that they are strongly periodic in the sense that they exhibit great uniformity in their time intervals. (ibid, 84-5, my emphases) Of course, some uncertainty is always involved in determining facts about uniformity

experimentally. But for a conventionalist this uncertainty arises solely from the limited

precision of measurement procedures and not from a lack of specificity in the definition.

96

Accordingly, the stability of contemporary timekeeping is explained by a combination of two

factors: on the social side, the worldwide agreement to define uniformity on the basis of the

frequency of the cesium transition; and on the natural side, the fact that all cesium atoms

under specified conditions have the same frequency associated with that particular transition.

The universality of the cesium transition frequency is, according to conventionalists, a mind-

independent empirical regularity that metrologists cannot influence but may only describe

more or less simply.

The second, constructivist sort of explanation affords standardization institutions

greater agency in the process of stabilization. Standardizing time is not simply a matter of

choosing which pre-existing natural regularity to exploit; rather, it is a matter of constructing

regularities from otherwise irregular instruments and human practices. Bruno Latour and

Simon Schaffer have expressed this position in the following ways:

Time is not universal; every day it is made slightly more so by the extension of an international network that ties together, through visible and tangible linkages, each of all the reference clocks of the world and then organizes secondary and tertiary chains of references all the way to this rather imprecise watch I have on my wrist. There is a continuous trail of readings, checklists, paper forms, telephone lines, that tie all the clocks together. As soon as you leave this trail, you start to be uncertain about what time it is, and the only way to regain certainty is to get in touch again with the metrological chains. (Latour 1987, 251, emphasis in the original) Recent studies of the laboratory workplace have indicated that institutions’ local cultures are crucial for the emergence of facts, and instruments, from fragile experiments. […] But if facts depend so much on these local features, how do they work elsewhere? Practices must be distributed beyond the laboratory locale and the context of knowledge multiplied. Thus networks are constructed to distribute instruments and values which make the world fit for science. Metrology, the establishment of standard units for natural quantities, is the principal enterprise which allows the domination of this world. (Schaffer 1992, 23)

According to Latour and Schaffer, the metrological enterprise makes a part of the

noisy and irregular world outside of the laboratory “fit for science” by forcing it to replicate

97

an order otherwise exhibited only under controlled laboratory conditions. Metrologists

achieve this aim by extending networks of instruments throughout the globe along with

protocols for interpreting, adjusting and comparing these instruments. The fact, then, that

metrologists succeed in stabilizing their networks should not be taken as evidence for pre-

existing regularities in the operation of instruments. On the contrary, the stability of

metrological networks explains why scientists discover regularities outside the laboratory:

these regularities have already been incorporated into their measuring instruments in the

process of their standardization.

This chapter will argue that both conventionalist and constructivist accounts of

standardization offer only partial and unsatisfactory explanations for the stability of

networks of standards. These accounts focus too narrowly on either natural or social

explanans, but any comprehensive picture of stabilization must incorporate both. I will

propose a third, ‘model-based’ alternative to the conventionalist and constructionist views of

standardization, which combines the strengths of the first two accounts and explains how

both natural and social elements are mobilized through metrological practice.

This third approach views standardization as an ongoing activity aimed at legislating

the proper mode of application of a theoretical concept to certain exemplary artifacts. By

‘legislation’ I mean the specification of rules for deciding which concrete particulars fall

under a concept. In the case of timekeeping, metrologists legislate the proper mode of

application of the concept of uniformity of time to an ensemble of atomic clocks. That is,

metrologists specify algorithms for deciding which of the clocks in the ensemble

approximate the theoretical ideal of uniformity more closely. Contrary to the views of

conventionalists, this legislation is not a matter of arbitrary, one-time stipulation. Instead, I

will argue that legislation is an ongoing, empirically-informed activity. This activity is

98

required because theoretical definitions by themselves do not completely determine how the

defined concept is to be applied to particulars. Moreover, I will show that such acts of

legislation are partly constitutive of the regularities metrologists discover in the behavior of

their instruments. Which clocks count as ‘ticking’ more uniformly relative to each other

depends – though only partially – on how metrologists legislate the mode of application of

the concept of uniformity.

A crucial part of legislation is the construction of idealized models of measuring

instruments. As I will argue, legislation proceeds by constructing a hierarchy of idealized

models that mediate between the theoretical definition of the concept and concrete artifacts.

These models are iteratively modified in light of empirical data so as to maximize the

regularity with which concrete instruments are represented under the theoretical concept.

Additionally, instruments themselves are modified in light of the most recent models so as

to maximize regularity further. In this reciprocal exchange between abstract and concrete

modifications, regular behavior is iteratively imposed on the network ‘from above’ and

discovered ‘from below’, leaving genuine room for both natural and social explanans in an

account of stabilization. Acts of legislation are therefore conceived both as constitutive of

the regularities exhibited by instruments and as preconditions for the empirical discovery of

new regularities (or irregularities) in the behaviors of those instruments64.

The first of this chapter’s three sections presents the central methods and challenges

involved in contemporary timekeeping. The second section discusses the strengths and

64 In this respect the model-based account continues the analysis of measurement offered by Kuhn ([1961] 1977). Kuhn took scientific theories to be both constitutive of the correct application of measurement procedures and as preconditions for the discovery of anomalies. The model-based account extends Kuhn’s insights to the maintenance of metrological standards, where local models play a role analogous to theories in Kuhn’s account.

99

limits of conventionalist and constructivist explanations for the stability of metrological

networks, while the third and final section develops the model-based account of

standardization and demonstrates why it provides a more complete and satisfactory

explanation for the stability of UTC than the first two.

3.2. Making time universal

3.2.1. Stability and accuracy

The measurement of time relies predominantly on counting the periods of cyclical

processes, namely clocks. Until the late 1960s, time was standardized by recurrent

astronomical phenomena such as the apparent solar noon, and artificial clocks served only as

secondary standards. Contemporary time standardization relies on atomic clocks, i.e.

instruments that produce an electromagnetic signal that tracks the frequency of a particular

atomic resonance. The two central desiderata for a reliable clock are known in the

metrological jargon as frequency stability and frequency accuracy. The frequency of a clock

is said to be stable if it ticks at a uniform rate, that is, if its cycles mark equal time intervals.

The frequency of a clock is said to be accurate if it ticks at the desired rate, e.g. one cycle per

second.

Frequency stability is, in principle, sufficient for reproducible timekeeping. A collection

of clocks with perfectly stable frequencies would tick at constant rates relative to each other,

and so the readings of any such clock would be sufficient to reproduce the readings of any

100

of the others by simple linear conversion65. A collection of frequency-stable clocks is

therefore also ‘stable’ in the broader sense of the term, i.e. supports the reproducibility of

measurement outcomes. For this reason I will use the term ‘stability’ insofar as it pertains to

collections of clocks without distinguishing between its restricted (frequency-stability) and

broader (reproducibility) senses unless the context requires otherwise.

In practice, no clock has a perfectly stable frequency. The very notion of a stable

frequency is an idealized one, derived from the theoretical definition of the standard second.

Since 1967 the second has been defined as the duration of exactly 9,192,631,770 periods of

the radiation corresponding to a hyperfine transition of cesium-133 in the ground state66. As

far as the definition is concerned, the cesium atom in question is at rest at zero degrees

Kelvin with no background fields influencing the energy associated with the transition.

Under these ideal conditions a cesium atom would constitute a perfectly stable clock. There

are several different ways to construct clocks that would approximate – or ‘realize’ – the

conditions specified by the definition. Different clock designs result in different trade-offs

between frequency accuracy, frequency stability and other desiderata, such as ease of

maintenance and ease of comparison.

Primary realizations of the second are designed for optimal accuracy, i.e. minimal

uncertainty with respect to the rate in which they ‘tick’. As of 2009 thirteen primary

realizations are maintained by leading national metrological laboratories worldwide67. These

clocks are special by virtue of the fact that every known influence on their output frequency

65 barring relativistic effects. 66 BIPM (2006), 113 67 As of 2009, active primary frequency standards were maintained by laboratories in France, Germany,

Italy, Japan, the UK, and the US (BIPM 2010, 33)

101

is controlled and rigorously modelled, resulting in detailed ‘uncertainty budgets.’ The clock

design implemented in most primary standards is the ‘cesium fountain’, so called because it

‘tosses’ cesium atoms up in a vacuum which then fall down due to gravity. This design

allows for a higher signal-to-noise ratio and therefore decreases measurement uncertainty.

The complexity of cesium fountains, however, and the need to routinely monitor their

performance and environment prevents them from running continuously. Instead, each

cesium fountain clock operates for a few weeks at a time, about five times a year. The

intermittent operation of cesium fountain clocks means that they cannot be used directly for

timekeeping. Instead, they are used to calibrate secondary standards, i.e. atomic clocks that are

less accurate but run continuously for years. About 350 such secondary standards are

employed to keep world time68. These clocks are highly stable in the short run, meaning that

the ratios between the frequencies of their ‘ticks’ remain very nearly constant over weeks and

months. But over longer periods the frequencies of secondary standards exhibit drifts, both

relative to each other and to the frequencies of primary standards.

Because neither primary nor secondary standards ‘tick’ at exactly the same rate,

metrologists are faced with a variety of real durations that can all be said to fit the definition

of the second with some degree of uncertainty. Metrologists are therefore faced with the

task of realizing the second based on indications from multiple, and often divergent, clocks.

In tackling this challenge, metrologists cannot simply appeal to the definition of the second

to tell them which clocks are more accurate as it is too idealized to serve as the basis for an

evaluation of concrete instruments. In Chapter 1 I have called this the problem of multiple

68 Panfilo and Arias (2009)

102

realizability of unit definitions and discussed the way this problem is solved in the case of

primary frequency standards.

This chapter focuses on the ways metrologists solve the problem of multiple

realizability in the context of international timekeeping, where the goal is not merely to

produce a good approximation of the second but also to maintain an ongoing measure of

time and synchronize clocks worldwide in accordance with this measure. Timekeeping is an

elaborate task that extends well beyond the evaluation of a handful of carefully maintained

primary standards. It encompasses the global transmission of time signals that enable

coordination in every aspect of civil and scientific life. From communication satellites, to

financial exchanges, to the dating of astronomical observations, Coordinated Universal Time

is meant to guarantee that all of our clocks tell the same time, and it must manage to do so

despite the fact that every clock that maintains UTC ‘ticks’ with a slightly different ‘second’.

From the point of view of relativity theory, UTC is an approximation of terrestrial time,

a theoretically defined coordinate time scale on the earth’s surface69. Ideally, one can imagine

all of the atomic clocks that participate in the production of UTC as located on a rotating

surface of equal gravitational potential that approximates the earth’s sea level. Such surface is

called a ‘geoid’, and terrestrial time is the time a perfectly stable clock on that surface would

tell when viewed by a distant observer. However, much like the definition of the second, the

definition of terrestrial time is highly idealized and does not specify the desired properties of

any concrete clock ensemble. Here again, metrologists cannot determine how well UTC

69 More exactly, it is International Atomic Time (TAI), identical to UTC except for leap seconds, that constitutes a realization of Terrestrial Time.

103

approximates terrestrial time based merely on the latter’s definition, and must compare UTC

to other realizations of terrestrial time.

3.2.2. A plethora of clocks

Let us now turn to the method by which metrologists create a universal measure of

time. At the BIPM near Paris, around 350 secondary standard indications from over sixty

national laboratories are processed. The BIPM receives a reading from each clock every five

days and uses these indications to produce UTC. Coordinated Universal Time is a measure

of time whose scale interval is intended to remain as close as is practically possible to a

standard second. Yet UTC is not a clock; it does not actually ‘tick’, and cannot be

continuously read off the display of any instrument. Instead, UTC is an abstract measure of

time: a set of numbers calculated monthly in retrospect, based on the readings of

participating clocks70. These numbers indicate how late or early each nation’s ‘master time’,

its local approximation of UTC, has been running in the past month. Typically ranging from

a few nanoseconds to a few microseconds, these numbers allow national metrological

institutes to then tune their clocks to internationally accepted time. Table 3.1 is an excerpt

from the monthly publication issued by the BIPM in which deviations from UTC are

reported for each national laboratory.

70 There are many clocks that approximate UTC, of course. As will be mentioned below, the BIPM and national laboratories produce continuous time signals that are considered realizations of UTC. However, UTC itself is an abstract measure and should not be confused with its many realizations.

104

Tab

le 3

.1: E

xcer

pt

from

Cir

cula

r-T

(B

IPM

201

1), a

mon

thly

rep

ort

thro

ugh

wh

ich

th

e In

tern

atio

nal

Bu

reau

of W

eigh

ts a

nd

Mea

sure

s d

isse

min

ates

C

oord

inat

ed U

niv

ersa

l Tim

e (U

TC

) to

nat

iona

l sta

nd

ard

izat

ion

inst

itu

tes.

Th

e n

umb

ers

in t

he

firs

t se

ven

col

um

ns

ind

icat

e d

iffe

ren

ces

in n

anos

econ

ds

bet

wee

n U

TC

an

d e

ach

of

its

loca

l ap

pro

xim

atio

ns.

Th

e la

st t

hre

e co

lum

ns

ind

icat

e ty

pe-

A, t

ype-

B a

nd

tot

al u

nce

rtai

nti

es f

or e

ach

com

par

ison

. (O

nly

dat

a as

soci

ated

wit

h t

he

firs

t tw

enty

lab

orat

orie

s is

sh

own

.)

105

In calculating UTC Metrologists face multiple challenges. First, among the clocks

contributing to UTC almost none are primary standards. As previously mentioned, most

primary standards do not run continuously. Subsequently UTC is maintained by a free-

running ensemble of secondary standards – stable atomic clocks that run continuously for

years but undergo less rigorous uncertainty evaluations than primary standards. Today the

majority of these clocks are commercially manufactured by Hewlett-Packard or one of its

offshoot companies, Agilent and Symmetricom. These clocks have proven to be

exceptionally stable relative to each other, and the number of HP clocks that participate in

UTC has been steadily increasing since their introduction into world timekeeping in the early

1990s. As of 2010 HP clocks constitute over 70 percent of contributing clocks71.

Comparing clocks in different locations around the globe requires a reliable method of

fixing the interval of comparison. This is another major challenge to globalising time. Were

the clocks located in the same room, they could be connected by optical fibres to a counter

that would indicate the difference, in nanoseconds, among their readings every five days.

Over large distances, time signals are transmitted via satellite. In most cases Global

Positioning System (GPS) satellites are used, thereby ‘linking’ the readings of participating

clocks to GPS time. But satellite transmissions are subject to delays, which fluctuate

depending on atmospheric conditions. Moreover, GPS time is itself a relatively unstable

derivative of UTC. These factors introduce uncertainties to clock comparison data known as

time transfer noise. Increasing with its distance from Paris, transfer noise is often much

71 Petit (2004, 208), BIPM (2010, 52-67). A smaller portion of continuously-running clocks are hydrogen masers, i.e. atomic clocks that probe a transition in hydrogen rather than in cesium.

106

larger than the local instabilities of contributing clocks. This means that the stability of UTC

is in effect limited by satellite transmission quality.

3.2.3. Bootstrapping reliability

The first step in calculating UTC involves processing data from hundreds of

continually operating atomic clocks and producing a free-running time scale, EAL (Échelle

Atomique Libre). EAL is an average of clock indications weighted by clock stability. Finding

out which clocks are more stable than others requires some higher standard of stability

against which clocks would be compared, but arriving at such a standard is the very goal of

the calculation. For this reason EAL itself is used as the standard of stability for the clocks

contributing to it. Every month, the BIPM rates the weight of each clock depending on how

well it predicted the weighted average of the EAL clock ensemble in the past twelve months.

The updated weight is then used to average clock data in the next cycle of calculation. This

method promotes clocks that are stable relative to each other, while clocks whose stability

relative to the overall average falls below a fixed threshold are given a weight of zero, i.e.

removed from that month’s calculation. The average is then recalculated based on the

remaining clocks. The process of removing offending clocks are recalculating is repeated

exactly four times in each monthly cycle of calculation72.

Though effective in weeding out ‘noisy’ clocks, the weight updating algorithm

introduces new perils to the stability of world time. First, there is the danger of a positive

72 Audoin and Guinot 2001, 249.

107

feedback effect, i.e. a case in which a few clocks become increasingly influential in the

calculation simply because they have been dominant in the past. In this scenario, EAL would

become tied to the idiosyncrasies of a handful of clocks, thereby increasing the likelihood

that the remaining clocks would drift further away from EAL. For this reason, the BIPM

limits the weight allowed to any clock to a maximum of about 0.7 percent73. The method of

fixing this maximum weight is itself occasionally modified to optimize stability.

Other than positive feedback, another source of potential instability is the abruptness

with which new clock weights are modified every month. Because different clocks ‘tick’ at

slightly different rates, a sudden change in weights results in a sudden change of frequency.

To avoid frequency jumps, the BIPM adds ‘cushion’ terms to the weighted average based on

a prediction of that month’s jump74. A third precautionary measure taken by the BIPM

assigns a zero weight to new clocks for a four month test interval before authorizing them to

exert influence on international time.

The results of averaging depend not only on the choice of clock manufacturer,

transmission method and averaging algorithm, but also on the selection of particular

participating clocks. Only laboratories in nations among the eighty members and associates

of BIPM are eligible for participation in the determination of EAL. Funded by membership

fees, the BIPM aims to balance the threshold requirements of metrological quality with the

financial benefits of inclusiveness. Membership requires national diplomatic relations with

France, the depositary of the intergovernmental treaty known as the Metre Convention

(Convention du Mètre). This treaty authorizes BIPM to standardize industrial and scientific

73 Since 2002, the maximal weight of each clock is limited to 2.5 / N, where N is the number of contributing clocks (Petit 2004, 308).

74 Audoin and Guinot 2001, 243-5.

108

measurement. The BIPM encourages participation in the Metre Convention by highlighting

the advantages of recognized metrological competence in the domain of global trade, and by

offering reduced fees to smaller states and developing countries75. Economic trends and

political considerations thus influence which countries contribute to world time, and

indirectly which atomic clocks are included in the calculation of UTC.

3.2.4. Divergent standards

Despite the multiple means employed to stabilize the weighted average of clock

readings, additional steps are necessary to guarantee stability, due to the fact that the

frequencies of continuously operating clocks tend to drift away from those of primary

standards. In the late 1950s, when atomic time scales were first calculated, they were based

solely on free-running clocks. Over the course of the following two decades, technological

advances revealed that universal time was running too fast: the primary standards that

realized the second were beating slightly slower than the clocks that kept time. To align the

two frequencies, in 1977 the second of UTC was artificially lengthened by one part in 1013.

At this time it was decided that the BIPM would make regular small corrections that would

‘steer’ the atomic second toward its officially realized duration, in at attempt to avoid future

shocks76. This decision effectively split atomic time into two separate scales, each ‘ticking’

with a slightly different second: on the one hand, the weighted average of free-running

75 Quinn (2003) 76 Audoin and Guinot 2001, 250

109

clocks (EAL), and on the other the continually corrected (or ‘steered’) International Atomic

Time, TAI (Temps Atomique International).

The monthly calculation of steering corrections is a remarkable algorithmic feat,

relying upon intermittent calibrations against the world’s ten cesium fountains. These

calibrations differ significantly from one another in quality and duration. Some primary

standards run for longer periods than others, resulting in a better signal; some calibrations

suffer from higher transfer noise; and some of the primary standards involved are more

accurate than others77. For this reason the BIPM assigns weights, or ‘filters’, to each

calibration episode depending on its quality. These checks are still not sufficient. Primary

standards do not agree with one another completely, giving rise to the concern that the

duration of the UTC second could fluctuate depending on which primary standard

contributed the latest calibration. To circumvent this, the steering algorithm is endowed with

‘memory’, i.e. it extrapolates data from past calibration episodes into times in which primary

standards are offline. This extrapolation must itself be time-dependent, as noise limits the

capacity of free-running clocks to ‘remember’ the frequency to which they were calibrated.

The BIPM therefore constructs statistical models for the relevant noise factors and uses

them to derive a temporal coefficient, which is then incorporated into the calculation of

‘filters’78.

This steering algorithm allows metrologists to track the difference in frequency

between free-running clocks and primary standards. Ideally, the difference in frequency

would remain stable, i.e. there would be a constant ratio between the ‘seconds’ of the two

77 See Chapter 1 for a detailed discussion of how the accuracy of primary standards is evaluated. 78 Azoubib et al (1977), Arias and Petit (2005)

110

measures. In this ideal case, requirements for both accuracy and stability would be fulfilled,

and a simple linear transformation of EAL would provide metrologists with a continuous

timescale as accurate as a cesium fountain. In practice, however, EAL continues to drift. Its

second has lengthened in the past decade by a yearly average of 4 parts in 1016 relative to

primary standards79. This presents metrologists with a twofold problem: first, they have to

decide how fast they want to ‘steer’ world time away from the drifting average. Overly

aggressive steering would destabilize UTC, while too small a correction would cause clocks

the world over to slowly diverge from the official (primary) second. Indeed, the BIPM has

made several modifications to its steering policy in the past three decades in at attempt to

optimize both smoothness and accuracy80. The second aspect of the problem is the need to

stabilize the frequency of EAL. One solution to this aspect of the problem is to replace

clocks in the ensemble with others that ‘drift’ to a lesser extent. This task has largely been

accomplished in the past two decades with the proliferation of HP clocks, but some

instability remains. Elimination or reduction of the remaining instability is likely to require

new algorithmic ‘tricks’. The BIPM is currently considering a change to the EAL weighting

method that would involve a more sophisticated prediction of the behaviour of clocks, a

change that is expected to further reduce frequency drifts81.

Disagreements among standards are not the sole condition requiring frequency

steering. Abrupt changes in the ‘official’ duration of the second as realized by primary

standards may also trigger steering corrections. These abrupt changes can occur when

metrologists modify the way in which they model their instruments. For example, in 1996

79 Panfilo and Arias (2009) 80 Audoin and Guinot 2001, 251 81 Panfilo and Arias (2009)

111

the metrological community achieved consensus around the effects of thermal background

radiation on cesium fountains, previously a much debated topic. A new systematic

correction was subsequently applied to primary standards that shortened the second by

approximately 2 parts in 1014. While this difference may seem minute, it took more than a

year of monthly steering corrections for UTC to ‘catch up’ with the suddenly shortened

second82.

3.2.5. The leap second

With the calculation of TAI the task of realizing the definition of the standard second

is complete. TAI is considered to be a realization of terrestrial time, that is, an

approximation of general-relativistic coordinate time on the earth’s sea level. However, a

third and last step is required to keep UTC in step with traditional time as measured by the

duration of the solar day. The mean solar day is slowly increasing in duration relative to

atomic time due to gravitational interaction between the earth and the moon. To keep ‘noon

UTC’ closely aligned with the apparent passage of the sun over the Greenwich meridian, a

leap second is occasionally added to UTC based on astronomical observations. By contrast,

TAI remains free of the constraint to match astronomical phenomena, and runs behind

UTC by an integer number of seconds83.

82 Audoin and Guinot 2001, 251 83 In January 2009 the difference between TAI and UTC was 34 seconds.

112

3.3. The two faces of stability

3.3.1. An explanatory challenge

The global synchronization of clocks in accordance with atomic time is a remarkable

technological feat. Coordinated Universal Time is disseminated to all corners of civil life,

from commerce and aviation to telecommunication, in manner that is seamless to the vast

majority of its users. This achievement is better appreciated when one contrasts it to the

state of time coordination less than a century-and-a-half ago, when the transmission of time

signals by telegraphic cables first became available. Peter Galison (2003) provides a detailed

history of the efforts involved in extending a unified ‘geography of simultaneity’ across the

globe during the 1870s and 1880s, when railroad companies, national observatories, and

municipalities kept separate and conflicting timescales. Today, the magnitude of

discrepancies among timekeeping standards is far smaller than the accuracy required by

almost all practical applications, with the exception of few highly precise astronomical

measurements.

The task of the remainder of this chapter is to explain how metrologists succeed in

synchronizing clocks worldwide to Coordinated Universal Time. What are the sources of

this measure’s efficacy in maintaining global consensus among time centers? An adequate

answer must account for the way in which the various ingredients that make up UTC

contribute to its success. In particular, the function of ad hoc corrections, rules of thumb and

seemingly circular inferences prevalent in the production of UTC requires explanation. What

role do these mechanisms play in stabilizing UTC, and is their use justified from an

epistemic point of view? The importance of this question extends beyond the measurement

113

of time. Answering it will require an account of the goals of standardization projects, the

sort of knowledge such projects produce, and the reasons they succeed or fail. I will begin by

considering two such accounts, namely conventionalism and constructivism, and argue that

they provide only partial and unsatisfactory explanations for the stability of contemporary

timekeeping standards. I will follow this by combining elements of both accounts in the

development of a third, model-based account of standardization that overcomes the

explanatory limitations of the first two.

3.3.2. Conventionalist explanations

Any plausible account of metrological knowledge must attend to the fact that

metrologists enjoy some freedom in determining the correct application of the concepts they

standardize. In order to properly understand the goals of standardization projects one must

first clarify the sources and scope of this freedom. Traditionally, philosophers of science

have taken standardization to consist in arbitrary acts of definition. Conventionalists like

Poincaré and Reichenbach stressed the arbitrary nature of the choice of congruence

conditions, that is, the conditions under which magnitudes of certain quantities such as

length and duration are deemed equal to one another. In his essay on “The Measure of

Time” ([1898] 1958), Poincaré argued against the existence of a mind-independent criterion

of equality among time intervals. Instead, he claimed that the choice of a standard measure

of time is “the fruit of an unconscious opportunism” that leads scientists to select the

simplest system of laws (ibid, 36). Reichenbach called these arbitrary choices of congruence

conditions ‘coordinative definitions’ because they coordinate between the abstract concepts

114

employed by a theory and the physical relations represented by these concepts (Reichenbach

1927, 14). In the case of time, the choice of congruence conditions amounts to a

coordinative definition of uniformity in the flow of time. Coordinative definitions are required

because theories by themselves do not specify the application conditions for the concepts

they define. A theory can only link concepts to one another, e.g. postulate that the concept

of uniformity of time is tied to the concept of uniform motion, but it cannot tell us which

real motions or frequencies count as uniform. For this, Reichenbach claimed, a coordinative

definition is needed that would link the abstract concept of uniformity with some concrete

method of time measurement. Prior to such coordinative definition there is no fact of the

matter as to whether or not two given time intervals are equal (ibid, 116).

The standardization of time, according to classical conventionalists, involves a free

choice of a coordinative definition for uniformity. It is worth highlighting three features of

this definitional sort of freedom as conceived by classical conventionalists. First, it is an a

priori freedom in the sense that its exercise is independent of experience. One may choose

any uniformity criterion as long as the consequences of that criterion do not contradict one

another. Second, it is a freedom only in principle and not in practice. For pragmatic reasons,

scientists select uniformity criteria that make their descriptions of nature as simple as

possible. The actual selection of coordinative definition is therefore strongly, if not uniquely,

constrained by the results of empirical procedures. Third, definitional freedom is singular in

the sense that it is completely exhausted by a single act of exercising it. Though a definition

can be replaced by another, each such replacement annuls the previous definition. In this

respect acts of definition are essentially ahistorical.

In the case of contemporary timekeeping, the definition of the second functions as a

coordinative definition of uniformity. Recall that the definition of the second specifies that

115

the period associated with a particular transition of the cesium atom is constant, namely, that

the cycles of the electromagnetic radiation associated with this transition are equal to each

other in duration. The definition of the second, in other words, fixes not only a unit of time

but also a criterion for the congruence of time intervals. In order to make this uniformity

criterion consistent across different relativistic reference frames, the cesium atom is said to

lie on the earth’s approximate sea level. The resulting coordinate timescale, terrestrial time,

provides a universal definition of uniformity while conveniently allowing earth-bound clocks

to approximate it.

According to conventionalists, once a coordinative definition of uniformity is chosen

the equality or inequality of durations is a matter of empirical fact. As the passage quoted

above from Carnap makes clear, the remaining task for metrologists is only to discover which

clocks ‘tick’ at a more stable rate relative to the chosen definition of uniformity and to

improve those clocks that were found to be less stable. Conventionalists, in other words,

explain the stability of networks of standards in naturalistic terms. A naturalistic explanation

for the stability of a network of standards is one that ultimately appeals to an underlying

natural regularity in the properties or behaviors of those standards. In the case of time

measurement, a conventionalist would claim that standardization is successful because the

operation of atomic clocks relies on an empirical regularity, namely the fact that the

frequency associated with the relevant transition is roughly the same for all cesium-133

atoms. This regularity may be described in ways that are more or less simple depending on

one’s choice of coordinative definition, but the empirical facts underlying it are independent

of human choice. Accordingly, a conventionalist explanation for the success of the

stabilizing mechanisms employed in the calculation of UTC is that these mechanisms make

UTC a reliable indicator of an underlying regularity, namely the constancy of the frequency

116

associated with different concrete cesium atoms used by different clocks84. Supposedly,

metrologists are successful in synchronizing clocks to UTC because the algorithm that

calculates UTC detects those clocks that ‘tick’ closer to the ideal cesium frequency and

distributes time adjustments accordingly.

The idea that UTC is a reliable indicator of a natural regularity gains credence from the

fact that UTC is gradually ‘steered’ towards the frequency of primary standards. As

previously mentioned, primary frequency standards are rigorously evaluated for uncertainties

and compared to each other in light of these evaluations. The fact that the frequencies of

different primary standards are consistent with each other within uncertainty bounds can be

taken as an indication for the regularity of the cesium frequency. Assuming, as metrologists

do85, that the long-term stability of UTC over years is due mostly to ‘steering’, one can

plausibly make the case that the algorithm that produces UTC is a reliable detector of a

natural regularity in the behavior of cesium atoms.

This nevertheless leaves unexplained the success of the mechanisms that keep UTC

stable in the short-term, i.e. when UTC is averaged over weeks and months. These

mechanisms include, among others, the ongoing redistribution of clock weights, the limiting

of maximum weight, the ‘slicing’ of steering corrections into small monthly increments and

the increasingly exclusive reliance on Hewlett-Packard clocks.

One way of accounting for these short-term stabilizing mechanisms is to treat them as

tools for facilitating consensus among metrological institutions. I will discuss this approach

84 This is a slight over-simplification, because not all the clocks that contribute to UTC are cesium clocks. As mentioned, some are hydrogen masers. The ‘regularity’ in question can therefore be taken more generally to be the constancy of frequency associated with any given atomic transition in some predefined set.

85 Audoin and Guinot 2001, 251

117

in the next subsection. Another option would be to look for a genuine epistemic function

that these mechanisms serve. To a conventionalist (as to any other naturalist), this means

finding a way of vindicating these self-stabilizing mechanisms as reliable indicators of an

underlying natural regularity. Because a reliable indicator is one that is sensitive to the

property being indicated, one should expect the relevant stabilizing mechanisms to do less

well when such regularity is not strongly supported by the data. In practice, however, no such

degradation in stability occurs. On the contrary, short-term stabilization mechanisms are

designed to be as insensitive to frequency drifts or gaps in the data as is practically possible.

It is rather the data that is continually adjusted to stabilize the outcome of the calculation. As

already mentioned, whenever a discrepancy among the frequencies of different secondary

standards persists for too long it is eliminated ad hoc, either by ignoring individual clocks or

by eventually replacing them with others that are more favorable to the stability of the

average. Frequency ‘shocks’ introduced by new clocks are numerically cushioned. Even

corrections towards primary standards, which are supposed to increase accuracy, are spread

over a long period by slicing them into incremental steering adjustments or by embedding

them in a ‘memory-based’ calculation.

The constancy of the cesium period in the short-term is therefore not tested by the

algorithm that produces UTC. For a test implies the possibility of failure, whereas the

stabilizing mechanisms employed by the BIPM in the short-term are fail-safe and intended

to guard UTC against instabilities in the data. Indeed, there is no sign that metrologists even

attempt to test the ‘goodness of fit’ of UTC to the individual data points that serve as the

input for the calculation, let alone that they are prepared to reject UTC if it does not fit the

data well enough. Rather than a hypothesis to be tested, the stability of the cesium period is

a presupposition that is written into the calculation from the beginning and imposed on the

118

data that serves as its input. This seemingly question-begging practice of data analysis

suggests either that metrological methods are fundamentally flawed or that the

conventionalist explanation overlooks some important aspect of the way UTC is supposed

to function. In Section 3.4 I will argue that the latter is the case, and that the seeming

circularity in the calculation of UTC dissolves once the normative role of models in

metrology is acknowledged.

3.3.3. Constructivist explanations

As we learned previously, UTC owes its short-term stability not to the detection of

regularities in underlying clock data, but rather to the imposition of a preconceived regularity

on that data. This regularity, i.e. the frequency stability of participating clocks relative to

UTC, is imposed on the data through weighting adjustments, time steps and frequency

corrections implemented in the various stages of calculation. Constructivist explanations for

the success of standardization projects make such regulatory practices their central

explanans. According to Latour and Schaffer (quoted above), the stability of global

timekeeping is explained by the ongoing efforts of metrological institutions to harness clocks

into synchronicity. Particularly, standard clocks agree about the time because metrologists

maintain a stable consensus as to which clocks to use and how the readings of these clocks

should be corrected. The stability of consensus is in turn explained by an international

bureaucratic cooperation among standardization institutes. To use Latour’s language, the

stability of the network of clocks depends on an ongoing flux of paper forms issued by a

network of calculation centers. When we look for the sources of regularity by which these

119

forms are circulated we do not find universal laws of nature but international treaties, trade

agreements and protocols of meetings among clock manufacturers, theoretical physicists,

astronomers and communication engineers. Without the efforts and resources continuously

poured into the metrological enterprise, atomic clocks would not be able to tell the same

time for very long.

From a constructivist perspective, the algorithm that produces UTC is a particularly

efficient mechanism for generating consensus among metrologists. Recall that Coordinated

Universal Time is nothing over and above a list of corrections that the BIPM prescribes to

the time signals maintained by local standardization institutes. By administering the

corrections published in the monthly reports of the BIPM, metrologists from different

countries are able to reach agreement despite the fact that their clocks ‘tick’ at different rates.

This agreement is not arbitrary but constrained by the need to balance the central authority

of the International Bureau with the autonomy of national institutes. The need for a trade-

off between centralism and autonomy accounts for the complexity of the algorithm that

produces UTC, which is carefully crafted to achieve a socially optimal compromise among

metrologists. A socially optimal compromise is one that achieves consensus with minimal

cost to local metrological authorities, making it worthwhile for them to comply with the

regulatory strictures imposed by the BIPM. Indeed, the algorithm is designed to distribute

the smallest adjustments possible among as many clocks as possible. Consequently, the

overall adjustments required to approximate UTC at any given local laboratory is kept to a

minimum.

In stressing the importance of ongoing negotiations among metrological institutions,

constructivists do not yet diverge from conventionalists, who similarly view the comparison

and adjustment of standards as prerequisites for the reproducibility of measurement results.

120

But constructivists go a step further and, unlike conventionalists, refuse to invoke the

presence of an underlying natural regularity in order to explain the stability of timekeeping

standards86. On the contrary, they remind us that regularity is imposed on otherwise

discrepant clocks for the sake of achieving commercial and economic goals. Only after the

fact does this socially-imposed regularity assume the appearance of a natural phenomenon.

Latour expresses this view by saying that “[t]ime is not universal; every day it is made slightly

more so by the extension of an international network [of standards]” (1987, 251). Schaffer

similarly claims that facts only “work” outside of the laboratory because metrologists have

already made the world outside of the laboratory “fit for science”(1992, 23). According to

these statements, if they are taken literally, quantitative scientific claims attain universal

validity not by virtue of any preexisting state of the world, but by virtue of the continued

efforts of metrologists who transform parts of the world until they reproduce desired

quantitative relations87. In what follows I will call this the reification thesis.

The reification thesis is a claim about the sources of regularity exhibited by

measurement outcomes outside of the carefully controlled conditions of a scientific

laboratory. This sort of regularity, constructivists hold, is constituted by the stabilizing

practices carried out by metrologists rather than simply discovered in the course of carrying

out such practices. In other words, metrologists do not simply detect those instruments and

methods that issue reproducible outcomes; rather, they enforce a preconceived order on

otherwise irregular instruments and methods until they issue sufficiently reproducible

86 Ian Hacking identifies explanations of stability as one of three ‘sticking points’ in the debate between social constructivists and their intellectual opponents (1999, 84-92)

87 These claims echo Thomas Kuhn’s in his essay “The Function of Measurement in Modern Physical Science” ([1961] 1977).

121

outcomes. Note that the reification thesis entails an inversion of explanans and

explanandum relative to the conventionalist account. It is the successful stabilization of

metrological networks that, according to Latour and Schaffer, explains universal regularities

in the operation of instruments rather than the other way around.

How plausible is this explanatory inversion in the case of contemporary timekeeping?

As already hinted at above, the constructivist account fits well with the details of the case

insofar as the short-term stability of standards is involved. In the short run, the UTC

algorithm does not detect frequency stability in the behavior of secondary standards but

imposes stability on their behavior. Whenever a discrepancy arises among different clocks it

is eliminated by ad hoc correction or by replacing some of the clocks with others. The ad hoc

nature of these adjustments guarantees that any instability, no matter how large, can be

eliminated in the short run simply by redistributing instruments and ‘paper forms’

throughout the metrological network.

The constructivist account is nevertheless hard pressed to explain the fact that the

corrections involved in maintaining networks of standards remain small in the long run. An

integral part of what makes a network of metrological standards stable is the fact that its

maintenance requires only small and occasional adjustments rather than large and frequent ones.

A network that reverted to irregularity too quickly after its last recalibration would demand

constant tweaking, making its maintenance ineffective. This long-term aspect of stability is

an essential part of what constitutes a successful network of standards, and is therefore in

need of explanation no less than its short-term counterpart. After all, nothing guarantees

that metrologists will always succeed in diminishing the magnitude and frequency of

corrections they apply to networks of instruments. How should one explain their success,

then, in those cases when they so succeed? Recall that the conventionalist appealed to

122

underlying regularities in nature to explain long-term stability: metrologists succeed in

stabilizing networks because they choose naturally stable instruments. But this explanatory

move is blocked for those who, like Latour and Schaffer, hold to the reification thesis with

its requirement of explanatory inversion.

To illustrate this point, imagine that metrologists decided to keep the same algorithm

they currently use for calculating UTC, but implemented it on the human heart as a standard

clock instead of the atomic standard88. As different hearts beat at different rates depending

on the particular person and circumstances, the time difference between these organic

standards would grow rapidly from the time of their latest correction. Institutionally

imposed adjustments would only be able to bring universal time into agreement for a short

while before discrepancies among different heart-clocks exploded once more. The same

algorithm that produces UTC would be able to minimize adjustments to a few hours per

month at best, instead of a few nanoseconds when implemented with atomic standards. In

the long run, then, the same mechanism of social compromise would generate either a highly

stable, or a highly unstable, network depending on nothing but the kind of physical process

used as a standard. Constructivists who work under the assumption of the reification thesis

cannot appeal to natural regularities in the behavior of hearts or cesium atoms as primitive

explanans, and would therefore be unable to explain the difference in stability.

Constructivists may respond by claiming that, for contingent historical reasons,

metrologists have not (yet) mastered reliable control over human hearts as they have over

cesium atoms. This is a historical fact about humans, not about hearts or cesium atoms.

However, even if this claim is granted, it offers no explanation for the difference in long-

88 A similar imaginary exercise is proposed by Carnap ([1966] 1995), pp. 80-84.

123

term stability but only admits the lack of such an explanation. Another possibility is for

constructivists to relax the reification thesis, and claim that metrologists do detect

preexisting regularities in the behavior of their instruments, but that such regularities do not

sufficiently explain how networks of standards are stabilized. Under this ‘moderate’ reification

thesis, constructivists admit that a combination of natural and socio-technological explanans

is required for the stability of metrological networks. The question then arises as to how the

two sorts of explanans should be combined into a single explanatory account. The following

section will provide such an account.

3.4. Models and coordination

3.4.1. A third alternative

As we have seen, conventionalists and constructivists agree that claims concerning

frequency stability are neither true nor false independent of human agency, but disagree

about the scope and limits of this agency. Conventionalists believe that human agency is

limited to an a priori freedom to define standards of uniformity. For example, the statement:

‘under specified conditions, the cesium transition frequency is constant’ is a definition of

frequency constancy. Once a choice of definition is made, stabilization is a matter of

discovering which clocks agree more closely with the chosen definition and improving those

clocks that do not agree closely enough. Hence the claim: ‘during period T1…T2, clock X

ticked at a constant frequency relative to the current definition of uniformity’ is understood

as an empirical claim whose truth or falsity cannot be modified by metrologists.

124

Constructivists argue instead that judgments about frequency stability cannot be

abstracted away from the concrete context in which they are made. Claims to frequency

stability are true or false only relative to a particular act of comparison among clocks, made

at a particular time and location in an ever changing network of instruments, protocols and

calculations. As evidenced in detail above, the metrological network of timekeeping

standards is continually rebalanced in light of considerations that have little or nothing to do

with the theoretical definition of uniformity. Quite apart from any ideal definition, de facto

notions of uniformity are multiple and in flux, being constantly modified through the actions

of standardization institutions. If claims to frequency stability appear universal and context-

free, it is only because they rely on metrological networks that have already been successfully

stabilized and ‘black-boxed’ so as to conceal their historicity.

In an attempt to reconcile the two views, one may be tempted to simply juxtapose

their explanans. One would adopt a conventionalist viewpoint to explain the long-term

stability of networks of standards and a constructivist viewpoint to explain short-term

stability. But such juxtaposition would be incoherent, because the two viewpoints make

contradictory claims. As already mentioned, constructivists like Latour and Schaffer reject

the very idea of pre-existing natural regularity, an idea that lies at the heart of

conventionalists explanations of stability. Any attempt to use elements of both views

without reconciling their fundamental tension can only provide an illusion of explanation.

The philosophical challenge, then, is to clarify exactly how constructivism can be ‘naturalized’

and conventionalism ‘socialized’ in a manner that explains both long- and short-term

stability. Meeting this challenge requires developing a subtler notion of natural regularity

than either view offers.

125

The model-based account of standardization that I will now propose does exactly that.

It borrows elements from both conventionalism and constructivism while modifying their

assumptions about the sources of regularity in both nature and society. As I will argue, this

account successfully explains both the long- and short-term stability of metrological

networks without involving contradictory suppositions.

The model-based account may be summarized by the following four claims:

(i) The proper way to apply a theoretical concept (e.g. the concept of uniformity

of time) depends not only on its definition but also on the way concrete

instruments are modeled in terms of that concept both theoretically and statistically;

(ii) Metrologists are to some extent free to influence the proper mode of

application of the concepts they standardize, not only through acts of

definition, but also by adjusting networks of instruments and by modifying

their models of these instruments;

(iii) Metrologists exercise this freedom by continually shifting the proper mode of

application of the concepts they standardize so as to maximize the stability of

their networks of standards;

(iv) In the process of maximizing stability, metrologists discover and exploit

empirical regularities in the behavior of their instruments.

In what follows I shall argue for each of these four claims and illustrate them in the

special case of contemporary timekeeping. In so doing I will show that the model-based

approach does a better job than the previous two alternatives at explaining the stability of

metrological standards.

126

3.4.2. Mediation, legislation, and models

The central goal of standardizing a theoretical concept, according to the model-based

approach, is to regulate the application of the concept to concrete particulars. A

standardization project is successful when the application of the concept is universally

consistent and independent of factors that are deemed local or irrelevant. In conventionalist

jargon, standardization projects ‘coordinate’ a theoretical concept to exemplary particulars.

But in the model-based approach such coordination is not exhausted by arbitrary acts of

definition. If coordination amounted to a kind of stipulative act as Reichenbach believed, the

correct way to apply theoretical concepts to concrete particulars would be completely

determinate once this stipulation is given. This is clearly not the case. Consider the

application of the concept of terrestrial time to a concrete cesium clock: the former is a

highly abstract concept, namely the timescale defined by the ‘ticks’ of a perfectly accurate

cesium clock on the ideal surface of the rotating geoid; the latter is a machine exhibiting a

myriad of imperfections relative to the theoretical ideal. How is one to apply the notion of

terrestrial time to the concrete clock, namely, decide how closely the concrete clock ‘ticks’

relative to the ideal terrestrial timescale? The definition of terrestrial time offers a useful

starting point, but on its own is far too abstract to specify a method for evaluating the

accuracy of any clock. Considerable detail concerning the design and environment of the

127

concrete clock must be added to the definition before the abstract concept can be

determinately applied to evaluate the accuracy of that clock89.

This adding of detail amounts, in effect, to the construction of a hierarchy of models of

concrete clocks at differing level of abstraction. At the highest level of this hierarchy we find

the theoretical model of an unperturbed cesium atom on the geoid. As mentioned, this

model defines the notion of terrestrial time, the theoretical timescale that is realized by

Coordinated Universal Time.

At the very bottom of this hierarchy lie the most detailed and specific models

metrologists construct of their apparatus. These models typically represent the various

systematic effects and statistical fluctuations influencing a particular ensemble of atomic

clocks housed in one standardization laboratory. These models are used for the calculation

of local approximations to UTC.

Mediating between these levels is a third model, perhaps more aptly termed a cluster of

theoretical and statistical models, grounding the calculation of UTC itself. The models in this

cluster are abstract and idealized representations of various aspects of the clocks that

contribute to UTC and their environments. Among these models, for example, are several

statistical models of noise (e.g. white noise, flicker noise and Brownian noise) as well as

simplified representations of the properties of individual clocks (weights, ‘filters’) and

properties of the ensemble as a whole (‘cushion’ terms, ‘memory’ terms.) Values of the

parameter called ‘Coordinated Universal Time’ are determined by analyzing clock data from

the past month in light of the assumptions of models in this cluster.

89 As I have shown in Chapter 1, the accuracy of measurement standards can only be evaluated once the definition of the concept being standardized is sufficiently de-idealized.

128

It is to this parameter, ‘Coordinated Universal Time’, that the concept of terrestrial

time is directly coordinated, rather than to any concrete clock90. Like Reichenbach, I am

using the term ‘coordination’ here to denote an act that specifies the mode of application of

an abstract theoretical concept. But the form that coordination takes in the model-based

approach is quite different than what classical conventionalists have envisioned. Instead of

directly linking concepts with objects (or operations), coordination consists in the

specification of a hierarchy among parameters in different models. In our case, the hierarchy

links a parameter (terrestrial time) in a highly abstract and simplified theoretical model of the

earth’s spacetime to a parameter (UTC) in a less abstract, theoretical-statistical cluster of

models of certain atomic clocks. UTC is in turn coordinated to a myriad of parameters

(UTC(k)) representing local approximations of UTC by even more detailed, lower-level

models.

Finally, the particular clocks that standardize terrestrial time are subsumed under the

lowest-level models in the hierarchy. I am using the term ‘subsumed under’ rather than

‘described by’ because the accuracy of a concrete clock is evaluated against the relevant low-

level model and not the other way around. This is an inversion of the usual way of thinking

about approximation relations. In most types of scientific inquiry abstract models are meant

to approximate their concrete target systems. But the models constructed during

standardization projects have a special normative function, that of legislating the mode of

application of concepts to concrete particulars. Indeed, standardization is precisely the

legislation of a proper mode of application for a concept through the specification of a

90 More exactly, the concept of terrestrial time is directly coordinated to TAI, i.e. to UTC prior to the addition of ‘leap seconds’ (see the discussion on ‘leap second’ in Section 3.2.5.)

129

hierarchy of models. At each level of abstraction, the models specify what counts as an

accurate application of the standardized concept at the level below.

Figure 3.1: A simplified hierarchy of approximations among model parameters in contemporary timekeeping. Vertical position on the diagram denotes level of abstraction and arrows denote

approximation relations. Note that concrete levels approximate abstract ones.

Consequently, the chain of approximations (or ‘realizations’) runs upwards in the

hierarchy rather than downwards: concrete clocks approximate local estimates of UTC,

which in turn approximate UTC as calculated by the International Bureau, which in turn

approximates the ideal timescale known as terrestrial time. Figure 3.1 summarizes the

various levels of abstraction and relations of approximation involved in contemporary

atomic timekeeping.

130

The inversion of approximation relations explains why metrologists deal with

discrepancies in the short run by adjusting clocks rather than by modifying the algorithm

that calculates UTC. If UTC were an experimental best-fit to clock indications, the practice

of correcting and excluding clocks would be suspect of question-begging. However, the goal

of the calculation is not to approximate clock readings, but to legislate the way in which

those readings should be corrected relative to the concept being standardized, namely

uniform time on the geoid (i.e. terrestrial time). The next subsection will clarify why

metrologists are free to perform such legislation.

3.4.3. Coordinative freedom

Equipped with a more nuanced account of coordination than that offered by

conventionalists, we can now proceed to examine how metrological practices influence the

mode of application of concepts. Conventionalists, recall, took the freedom involved in

coordination to be a priori, in principle and singular. According to the model-based account,

metrologists who standardize concepts enjoy a different sort of freedom, one that is

empirically constrained and practically exercised in an ongoing manner. Specifically,

metrologists are to some extent free to decide not only how they define an ideal

measurement of the quantity they are standardizing, but also what counts as an accurate concrete

approximation (or ‘realization’) of this ideal.

The freedom to choose what counts as an accurate approximation of a theoretical ideal

is special to metrology. It stems from the fact that, in the context of a standardization

project, the distribution of errors among different realizations of the quantity being

131

standardized is not completely determinate. Until metrologists standardize a quantity-

concept, its mode of application remains partially vague, i.e. some ambiguity surrounds the

proper way of evaluating errors associated with measurements of that quantity. Indeed, in

the absence of such ambiguity standardization projects would be not only unnecessary but

impossible. Nevertheless, ambiguity of this sort cannot be dissolved simply by making more

measurements, as a determinate standard for judging what would count as a measurement

error is the very thing metrologists are trying to establish. This problem of indeterminacy is

illustrated most clearly in the case of systematic error91.

The inherent ambiguity surrounding the distribution of errors in the context of

standardization projects leaves metrologists with some freedom to decide how to distribute

errors among multiple realizations of the same quantity. Consequently, metrologists enjoy

some freedom in deciding how to construct models that specify what counts as an ideal

measurement of the quantity they are standardizing in some local context. Concrete

instruments are then subsumed under these idealized models, and errors are evaluated

relative to the chosen ideal.

Metrologists make use of this freedom to fit the mode of application of the concept to

the goals of the particular standardization project at hand. In some cases such goals may be

‘properly’ cognitive, e.g. the reduction of uncertainty, a goal which dominates choices of

primary frequency realizations. But in general there is no restriction on the sort of goals that

may inform choices of realization, and they may include economic, technological and

political considerations.

91 For a detailed argument to this effect see Chapter 2 of this thesis, “Systematic Error and the Problem of Quantity Individuation.”

132

The freedom to represent and distribute errors in accordance with local and pragmatic

goals explains why metrologists allow themselves to introduce seemingly self-fulfilling

mechanisms to stabilize UTC. Rather than ask: ‘how well does this clock approximate

terrestrial time?’ metrologists are, to a limited extent, free to ask: ‘which models should we

use to apply the concept of terrestrial time to this clock?’ In answering the second question

metrologists enjoy some interpretive leeway, which they use to maximize the short-term

stability of their clock ensemble. This is precisely the role of the algorithmic mechanisms

discussed above. These self-stabilizing mechanisms do not require justification for their

ability to approximate terrestrial time because they are legislative with respect to the

application of the concept of terrestrial time to begin with. UTC is successfully stabilized in

the short run not because its calculation correctly applies the concept of terrestrial time to

secondary standards; rather, UTC is chosen to determine what counts as a correct

application of the concept of terrestrial time to secondary standards because this choice

results in greater short-term stability. Contrary to conventionalist explanations of stability,

then, the short-term stability of UTC cannot be fully explained by the presence of an

independently detectable regularity in the data from individual clocks. Instead, a complete

explanation must non-reducibly appeal to stabilizing policies adopted by metrological

institutions. These policies are designed in part to promote a socially optimal compromise

among those institutions.

Coordination is nonetheless not arbitrary. The sort of freedom metrologists exercise in

standardizing quantity concepts is quite different than the sort of freedom typically

associated with arbitrary definition. As the recurring qualification ‘to some extent’ in the

discussion above hints, the freedom exercised by metrologists in practice is severely, though

not completely, constrained by empirical considerations. First, the quantity concepts being

133

standardized are not ‘free-floating’ concepts but are already embedded in a web of

assumptions. Terrestrial time, for example, is a notion that is already deeply saturated with

assumptions from general relativity, atomic theory, electromagnetic theory and quantum

mechanics. The task of standardizing terrestrial time in a consistent manner is therefore

constrained by the need to maintain compatibility with established standards for other

quantities that feature in these theories. Second, terrestrial time may be approximated in

more than one way. The question ‘how well does clock X approximate terrestrial time?’ is

therefore still largely an empirical question even in the context of a standardization project. It

can be answered to a good degree of accuracy by comparing the outcomes of clock X with

other approximations of terrestrial time. Such approximations rely on post-processed data

from primary cesium standards or on astronomical time measurements derived from the

observation of pulsars. But these approximations of terrestrial time do not completely agree

with one another. More generally, different applications of the same concept to different

domains, or in light of a different trade-off between goals, often end up being somewhat

discrepant in their results. Standardization institutes continually manage a delicate balance

between the extent of legislative freedom they allow themselves in applying concepts and the

inevitable gaps discovered among multiple applications of the same concept. Nothing

exemplifies better the shifting attitudes of the BIPM towards this trade-off than the history

of ‘steering’ corrections, which have been dispensed aggressively or smoothly over the past

decades depending on whether accuracy or stability was preferred.

The gaps discovered between different applications of the same quantity-concept are

among the most important (though by no means the only) pieces of empirical knowledge

amassed by standardization projects. Such gaps constitute empirical discoveries concerning the

existence or absence of regularities in the behavior of instruments, and not merely about the

134

way metrologists use their concepts. This is a crucial point, as failing to appreciate it risks

mistaking standardization projects for exercises in the social regulation of data-analysis

practices. Even if metrologists reached perfect consensus as to how they apply a given

quantity concept, there is no guarantee that the application they have chosen will lead to

consistent results. Success and failure in applying a quantity concept consistently are to be

investigated empirically, and the discovery of gaps (or their absence) is accordingly a matter

of obtaining genuine empirical knowledge about regularities in nature.

The discovery of gaps explains the possibility of stabilizing networks of standards in

the long run. Metrologists choose to use as standards those instruments to which they have

managed to apply the relevant concept most consistently, i.e. with the smallest gaps. To

return to the example above, metrologists have succeeded in applying the concept of

temporal uniformity to different cesium atoms with much smaller gaps than to different

heart rates. This is not only a fact about the way metrologists apply the concept of

uniformity, but also about a natural regularity in the behavior of cesium atoms, a regularity

that is discovered when cesium clocks are subsumed under the concept of uniformity

through the mediation of relevant models. Metrologists rely on such regularities for their

choices of physical standards, i.e. they tend to select those instruments whose behavior

requires the smallest and least frequent ad hoc corrections. Moreover, as standardization

projects progress, metrologists often find new theoretical and statistical means of predicting

some of the gaps that remain, thereby discovering ever ‘tighter’ regularities in the behavior

of their instruments.

The notion of empirical regularity employed by the model-based account differs from

the empiricist one adopted by classical conventionalists. Conventionalists equated regularity

with a repeatable relation among observations. Carnap, for example, identified regularity in

135

the behavior of pendulums with the constancy of the ratio between the number of swings

they produce ([1966] 1995, 82). This naive empiricist notion of regularity pertains to the

indications of instruments. By contrast, my notion of regularity pertains to measurement

outcomes, i.e. to estimates that have already been corrected in light of theoretical and statistical

assumptions92. The behavior of measuring instruments is deemed regular relative to some set

of modeling assumptions insofar as their outcomes are predictable under those assumptions.

Prior to the specification of modeling assumptions there can be no talk of regularities,

because such assumptions are necessary for forming expectations about which configuration

of indications would count as regular. Hence modeling assumptions are strongly constitutive

of empirical regularities in my sense of the term. At the same time, regularities are still

empirical, as their existence depends on which indications instruments actually produce.

Empirical regularities, in other words, are co-produced by observations as well as the

assumptions with which a scientific community interprets those observations.

This Kantian-flavored, dual-source conception of regularity explains the possibility of

legislating to nature the conditions under which time intervals are deemed equal. Recall that

acts of legislation determine not only how concepts are applied, but also which

configurations of observations count as regular. For example, which clocks ‘tick’ closer to

the natural frequency of the cesium transition depends on which rules metrologists choose

to follow in applying the concept of natural uniformity93. This is not meant to deny that

there may be mind-independent facts about the frequency stability of clocks, but merely to

92 My analysis of the notion of empirical regularity is therefore similar to my analysis of the notion of agreement discussed in Chapter 2.

93 Kant would have disagreed with last statement, as he took time to be a universal form of intuition and the synthesis of temporal relations to be governed by universal schemata regardless of one’s theoretical suppositions. The inspiration I draw from Kant does not imply a wholesale adoption of his philosophy.

136

acknowledge that such mind-independent facts, if they exist, play no role in grounding

knowledge claims about frequency stability. Indeed, the standardization of terrestrial time

would be impossible were metrologists required to obtain such facts, which pertain to ideal

and experimentally inaccessible conditions. From the point of view of the model-based

account, by contrast, there is nothing problematic about this inaccessibility, as the

application of a concept does not require satisfying its theoretical definition verbatim.

Instead, metrologists have a limited but genuine authority to legislate empirical regularities to

their observations, and hence to decide which approximations of the definition are closer

than others, despite not having experimental access to the theoretical ideal.

3.5. Conclusions

This chapter has argued that the stability of the worldwide consensus around

Coordinated Universal Time cannot be fully explained by reduction to either the natural

regularity of atomic clocks or the consensus-building policies enforced by standardization

institutes. Instead, both sorts of explanans dovetail through an ongoing modeling activity

performed by metrologists. Standardization projects involve an iterative exchange between

‘top-down’ adjustment to the mode of application of concepts and ‘bottom-up’ discovery of

inconsistencies in light of this application94.

94 This double-sided methodological configuration is an example of Hasok Chang’s (2004, 224-8) ‘epistemic iterations.’ It is also reminiscent of Andrew Pickering’s (1995, 22) patterns of ‘resistance and accommodation’, with the important difference that Pickering does not seem to ascribe his ‘resistances’ to underlying natural regularities.

137

This bidirectional exchange results in greater stability as it allows metrologists to latch

onto underlying regularities in the behavior of their instruments while redistributing errors in

a socially optimal manner. When modeling the behavior of their clocks, metrologists are to

some extent free to decide which behaviors count as naturally regular, a freedom which they use

to maximize the efficiency of a social compromise among standardizing institutions. The

need for effective social compromise is therefore one of the factors that determine the

empirical content of the concept of a uniformly ‘ticking’ clock. On the other hand, the need

for consistent application of this concept is one of the factors that determine which social

compromise is most effective. The model-based account therefore combines the

conventionalist claim that congruity is a description-relative notion with the constructivist

emphases on the local, material and historical contexts of scientific knowledge.

138


4. Calibration: Modeling the Measurement Process

Abstract: I argue that calibration is a special sort of modeling activity, namely the activity of constructing, testing and deriving predictions from theoretical and statistical models of a measurement process. Measurement uncertainty is accordingly a special sort of predictive uncertainty, namely the uncertainty involved in predicting the outcomes of a measurement process based on such models. I clarify how calibration establishes the accuracy of measurement outcomes and the role played by measurement standards in this procedure. Contrary to currently held views, I show that establishing a correlation between instrument indications and standard quantity values is neither necessary nor sufficient for successful calibration.

4.1. Introduction

A central part of measuring is evaluating accuracy. A measurement outcome that is not

accompanied by an estimate of accuracy is uninformative and hence useless. Even when a

value range or standard uncertainty are not explicitly reported with a measurement outcome,

a rough accuracy estimate is implied by the practice of recoding only ‘meaningful digits’. And

yet the requirement to evaluate accuracy gives rise to an epistemological conundrum, which I

have called ‘the problem of accuracy’ in the introduction to this thesis. The problem arises

because the exact values of most physical quantities are unknowable. Quantities such as

139

length, duration and temperature, insofar as they are represented by non-integer (e.g. rational

or real) numbers, are impossible to measure with certainty. The accuracy of measurements

of such quantities cannot, therefore, be evaluated by reference to exact values but only by

comparing uncertain estimates to each other. When comparing two uncertain estimates of

the same quantity it is impossible to tell exactly how much of the difference between them is

due to the inaccuracy of either estimate. Multiple ways of distributing errors among the two

estimates are consistent with the data. The problem of accuracy, then, is an

underdetermination problem: the available evidence is insufficient for grounding claims

about the accuracy of any measurement outcome in isolation, independently of the

accuracies of other measurements95.

One attempt to solve this problem which I have already discussed is to adopt a

conventionalist approach to accuracy. Mach ([1896] 1966) and later Carnap ([1966] 1995)

and Ellis (1966) thought that the problem of accuracy could be solved by arbitrarily selecting

a measuring procedure as a standard. The accuracies of other measuring procedures are then

evaluated against the standard, which is considered completely accurate. The disadvantages

of the conventionalist approach to accuracy have already been explored at length in the

previous chapters. As I have shown, measurement standards are necessarily inaccurate to

some extent, because the definitions of the quantities they standardize necessarily involve

95 The problem of accuracy can be formulated in other ways, i.e. as a regress or circularity problem rather than an underdetermination problem. In the regress formulation, the accuracy of a set of estimates is established by appealing to the accuracy of yet another estimate, etc. In the circularity formulation, the accuracy of one estimate is established by appealing to the accuracy of a second estimate, whose accuracy is in turn established by appeal to the accuracy of the first. All of these formulations point to the same underlying problem, namely the insufficiency of comparisons among uncertain estimates for determining accuracy. I prefer the underdetermination formulation because it makes it easiest to see why auxiliary assumptions about the measuring process can help solve the problem.

140

some idealization96. Moreover, the inaccuracies associated with measurement standards are

themselves evaluated by mutual comparisons among standards, a fact that further

accentuates the problem of accuracy.

In Chapter 1 I provided a solution to the problem of accuracy in the special case of

primary measurement standards. I showed that a robustness test performed among the

uncertainties ascribed to multiple standards provides sufficient grounds for making accuracy

claims about those standards. The task of the current chapter is to generalize this solution to

any measuring procedure, and to explain how the methods actually employed in physical

metrology accomplish this solution. Specifically, my aim will be to clarify how the various

activities that fall under the title ‘calibration’ support claims to measurement accuracy.

At first glance this task may appear simple. It is commonly thought that calibration is

the activity of establishing a correlation between the indications of a measuring instrument

and a standard. Marcel Boumans, for example, states that “A measuring instrument is

validated if it has been shown to yield numerical values that correspond to those of some

numerical assignments under certain standard conditions. This is also called calibration

[…].” (2007, 236). I have already shown that there is good reason to think that primary

measurement standards are accurate up to their stated uncertainties. Is it not obvious that

calibration, which establishes a correlation with standard values, thereby also establishes the

accuracy of measuring instruments?

96 Even when the definition of a unit refers to a concrete object such as the Prototype Kilogram, the specification of a standard measuring procedure still involves implicit idealizations, such as the possibility of creating perfect copies of the Prototype and the possibility of constructing perfect balances to compare the mass of the Prototype to those of other objects.

141

This seemingly straightforward way of thinking about calibration neglects a more

fundamental epistemological challenge, namely the challenge of clarifying the importance of

standards for calibration in the first place. Given that the procedures called ‘standards’ are to

some extent inaccurate, and given that some measuring procedures are more accurate than

the current standard (as shown in Chapter 1), why should one calibrate instruments against

metrological standards rather than against any other sufficiently accurate measuring

procedure?

In what follows I will show that establishing a correlation between instrument

indications and standard values is neither necessary nor sufficient in general for successful

calibration. The ultimate goal of calibration is not to establish a correlation with a standard,

but to accurately predict the outcomes of a measuring procedure. Comparison to a standard is but one

method for generating such predictions, a method that is not always required and is often

inaccurate by itself. Indeed, only in the simplest and most inaccurate case of calibration

(‘black-box’ calibration) is predictability achieved simply by establishing empirical

correlations between instrument indications and standard values. A common source of

misconceptions about calibration is that this simplest form of calibration is mistakenly

thought to be representative of the general case. The opposite is true: ‘black-box’ calibration

is but a special case of a much more complex way of representing measuring instruments

that involves detailed theoretical and statistical considerations.

As I will argue, calibration is a special sort of modeling activity, one in which the

system being modeled is a measurement process. I propose to view calibration as a modeling

activity in the full-blown sense of the term ‘modeling’, i.e. constructing an abstract and

142

idealized representation of a system from theoretical and statistical assumptions and using

this representation to explain and predict that system’s behaviour97.

I will begin by surveying the products of calibration as explicated in the metrological

literature (Section 4.2) and distinguish between two calibration methodologies that,

following Boumans (2006), I call ‘black-box’ and ‘white-box’ calibration (Sections 4.3 and

4.4). I will show that white-box calibration is the more general of the two, and that it is

aimed at predicting measurement outcomes rather than mapping indications to standard

values. Section 4.5 will then discuss the role of metrological standards in calibration and

clarify the conditions under which their use contributes to the accurate prediction of

measurement outcomes. Finally, Section 4.6 will explain how the accuracy of measurement

outcomes is evaluated on the basis of the model-based predictions produced during

calibration.

4.2. The products of calibration

4.2.1. Metrological definition

The International Vocabulary of Metrology (VIM) defines calibration in the following way:

Calibration: operation that, under specified conditions, in a first step, establishes a relation between the quantity values with measurement uncertainties provided by measurement standards and corresponding indications with associated measurement uncertainties and, in a second step, uses this information to establish a relation for obtaining a measurement result from an indication. (JCGM 2008, 2.39)

97 See also Mari 2005.

143

This definition is functional, that is, it characterizes calibration through its products.

Two products are mentioned in the definition, one intermediary and one final. The final

product of calibration operations is “a relation for obtaining a measurement result from an

indication”, whereas the intermediary product is “a relation between the quantity values […]

provided by measurement standards and corresponding indications”. Calibration therefore

produces knowledge about certain relations. My aim in this section will be to explicate these

relations and their relata. The following three sections will then provide a methodological

characterization of calibration, namely, a description of several common strategies by which

metrologists establish these relations. In each case I will show that the final product of

calibration – a relation for obtaining a measurement result from an indication – is established

by making model-based predictions about the measurement process. This methodological

characterization will in turn set the stage for the epistemological analysis of calibration in the

last section.

4.2.2. Indications vs. outcomes

The first step in elucidating the products of calibration is to distinguish between

measurement outcomes (or ‘results’) and instrument indications, a distinction previously

discussed in Chapter 2. To recapitulate, an indication is a property of the measuring

instrument in its final state after the measurement process is complete. Examples of

indications are the numerals appearing on the display of a digital clock, the position of an

ammeter pointer relative to a dial, and the pattern of diffraction produced in x-ray

crystallography. Note that the term ‘indication’ in the context of the current discussion

144

carries no normative connotation. It does not presuppose reliability or success in indicating

anything, but only an intention to use such outputs for reliable indication of some property of

the sample being measured. Note also that indications are not numbers: they may be

symbols, visual patterns, acoustic signals, relative spatial or temporal positions, or any other

sort of instrument output. However, indications are often represented by mapping them

onto numbers, e.g. the number of ‘ticks’ the clock generated at a given period, the

displacement of the pointer relative to the ammeter dial, or the spatial density of diffraction

fringes. These numbers, which may be called ‘processed indications’, are convenient

representations of indications in mathematical form98. A processed indication is not yet an

estimate of any physical quantity of the sample being measured, but only a mathematical

description of a state of the measuring apparatus.

A measurement outcome, by contrast, is an estimate of a quantity value associated with

the object being measured, an estimate that is inferred from one or more indications.

Outcomes are expressed in terms of a particular unit on a particular scale and include, either

implicitly or explicitly, an estimate of uncertainty. Respective examples of measurement

outcomes are an estimate of duration in seconds, an estimate of electric current in Ampere,

and an estimate of distance between crystal layers in nanometers. Very often measurement

outcomes are recorded in the form of a mean value and a standard deviation that represents

the uncertainty around the mean, but other forms are commonly used, e.g. min-max value

range.

98 The difference between numbers and numerals is important here. Before processing, an indication is never a number, though it may be a numeral (i.e. a symbol representing a number).

145

To attain the status of a measurement outcome, an estimate must be abstracted away

from its concrete method of production and pertain to some quantity objectively, namely, be

attributable to the measured object rather than the idiosyncrasies of the measuring instrument,

environment and human operators. Consider the ammeter: the outcome of measuring with

an ammeter is an estimate of the electric current running through the input wire. The

position of the ammeter pointer relative to the dial is a property of the ammeter rather than

the wire, and is therefore not a candidate for a measurement outcome. This is the case

whether or not the position of the pointer is represented on a numerical scale. It is only once

theoretical and statistical background assumptions are made and tested about the behaviour

of the ammeter and its relationship with the wire (and other elements in its environment)

that one can infer estimates of electric current from the position of the pointer. The ultimate

aim of calibration is to validate such inferences and characterize their uncertainty.

Processed indications are easily confused with measurement outcomes partly because

many instruments are intentionally designed to conceal their difference. Direct-reading

instruments, e.g. household mercury thermometers, are designed so that the numeral that

appears on their display already represents the best estimate of the quantity of interest on a

familiar scale. The complex inferences involved in arriving at a measurement outcome from

an indication are ‘black-boxed’ into such instruments, making it unnecessary for users to

infer the outcome themselves99. Regardless of whether or not users are aware of them, such

inferences form an essential part of measuring. They link claims such as ‘the pointer is

99 Somewhat confusingly, the process of ‘black-boxing’ is itself sometimes called ‘calibration’. For example, the setting of the null indication of a household scale to the zero mark is sometimes referred to as ‘calibration’. From a metrological viewpoint, this terminological confusion is to be avoided: “Adjustment of a measuring system should not be confused with calibration, which is a prerequisite for adjustment” (JCGM 2008, 3.11, Note 2). Calibration operations establish a relation between indications and outcomes, and this relation may later be expressed in a simpler manner by adjusting the display of the instrument.

146

between the 0.40 and 0.41 marks on the dial’ to claims like ‘the current in the wire is

0.405±0.005 Ampere’. If such inferences are to be deemed reliable, they must be grounded

in tested assumptions about the behaviour of the instrument and its interactions with the

sample and the environment.

4.2.3. Forward and backward calibration functions

The distinction between indications and outcomes allows us to clarify the two

products of calibration mentioned in the definition above. The intermediary product, recall,

is “a relation between the quantity values with measurement uncertainties provided by

measurement standards and corresponding indications with associated measurement

uncertainties”. This relation may be expressed in the form of a function, which I will call the

‘forward calibration function’:

<indication> = fFC ( <quantity value>, <additional parameter values>) (4.1)

The forward calibration function maps values of the quantity to be measured – e.g. the

current in the wire – to instrument indications, e.g. the position of the ammeter pointer100.

100 The term ‘calibration function’ (also ‘calibration curve’, see JCGM 2008, 4.31) is commonly used in metrological literature, whereas the designations ‘forward’ and ‘backward’ are my own. I call this a ‘forward’ function because its input values are normally understood as already having a determinate value prior to measurement and as determining its output value through a causal process. Nevertheless, my account of

147

The forward calibration function may include input variables representing additional

quantities that may influence the indication of the instrument – for example, the intensity of

background magnetic fields in the vicinity of the ammeter. The goal of the first step of

calibration is to arrive at a forward calibration function and characterize the uncertainties

associated with its outputs, i.e. the instrument’s indications. This involves making theoretical

and statistical assumptions about the measurement process and empirically testing the

consequences of these assumptions, as we shall see below.

The second and final step of calibration is aimed at establishing “a relation for

obtaining a measurement result from an indication.” This relation may again be expressed in

the form of a function, which may be called the ‘backward calibration function’ or simply

‘calibration function’:

<quantity value> = fC ( <indication>, <additional parameter values>) (4.2)

A calibration function maps instrument indications to values of the quantity being

measured, i.e. to measurement outcomes. Like the forward function, the calibration function

may include additional input variables whose values affect the relation between indications

and outcomes. In the simplest (‘black-box’) calibration procedures additional input

parameters are neglected, and a calibration function is obtained by simply inverting the

forward function. Other (‘white-box’) calibration procedures represent the measurement

process is more detail, and the derivation of the calibration function becomes more

calibration does not presuppose this classical picture of measurement, and is compatible with the possibility that the quantity being measured does not have a determinate value prior to its measurement.

148

complex. Once a calibration function is established, metrologists use it to associate values of

the quantity being measured with indications of the instrument.

So far I have discussed the products of calibration without explaining how they are

produced. My methodological analysis of calibration will proceed in three stages, starting

with the simplest method of calibration and gradually increasing in complexity. The cases of

calibration I will consider are:

1. Black-box calibration against a standard whose uncertainty is negligible

2. White-box calibration against a standard whose uncertainty is negligible

3. White-box calibration against a standard whose uncertainty is non-negligible (‘two-

way white-box’ calibration)

In each case I will show that the products of calibration are obtained by constructing

models of the measurement process, testing the consequences of these models and deriving

predictions from them. Viewing calibration as a modeling activity will in turn provide the key

to understanding how calibration establishes the accuracy of measurement outcomes.

4.3. Black-box calibration

In the most rudimentary case of calibration, the measuring instrument is treated as a

‘black-box’, i.e. as a simple input-output unit. The inner workings of the instrument and the

various ways it interacts with the sample, environment and human operators are either

neglected or drastically simplified. Establishing a calibration function is then a matter of

149

establishing a correlation between the instrument’s indications and corresponding quantity

values associated with a measurement standard.

For example, a simple caliper may be represented as a ‘black-box’ that converts the

diameter of an object placed between its legs to a numerical reading. The caliper is calibrated

by concatenating gauge blocks – metallic bars of known length – between the legs of the

caliper. We can start by assuming, for the time being, that the uncertainties associated with

the length of these standard blocks are negligible relative to those associated with the

outcomes of the caliper measurement. Calibration then amounts to a behavioural test of the

instrument under variations to the standard sample. The indications of the caliper are

recorded for different known lengths and a curve is fitted to the data points based on

background assumptions about how the caliper is expected to behave. The resulting forward

calibration function is of the form:

I0 = fFC (O) (4.3)

This function maps the lengths (O) associated with a combination of gauge blocks to

the indications of the caliper (I0). Notice that despite the simplicity of this operation, some

basic theoretical and statistical assumptions are involved. First, the shape chosen for fFC

depends on assumptions about the way the caliper converts lengths to indications. Second,

the use of gauge blocks implicitly assumes that length is additive under concatenation

operations. These assumptions are theoretical, i.e. they suppose that length enters into

certain nomic relations with other quantities or qualities. Third, associating uncertainties with

the indications of the caliper requires making one or more statistical assumptions, for

example, that the distribution of residual errors is normal. All of these assumptions are

150

idealizations: the response of any real caliper is not exactly linear, the concatenation of

imperfect rods is not exactly additive, and the distribution of errors is never exactly normal.

The first step of calibration is meant to test how well these idealizations, when taken

together, approximate the actual behaviour of the caliper. If they fit the data closely enough

for the needs of the application at hand, these idealized assumptions are then presumed to

continue to hold beyond the calibration stage, when the caliper is used to estimate the

diameter of non-standard objects. Under a purely behavioural, black-box representation of

the caliper, its calibration function is obtained by simple inversion of the forward function:

O = fC (I0) = f -1FC (I0) (4.4)

The calibration function expresses a hypothetical nomic relation between indications

and outcomes, a relation that is derived from a rudimentary theoretical-statistical model of

the instrument. This function may now be used to generate predictions concerning the

outcomes of caliper measurements. Whenever the caliper produces indication i the diameter

of the object between the caliper’s legs is predicted to be o = fC (i). Under this simple

representation of the measurement process, the uncertainty associated with measurement

outcomes arises wholly from uncontrolled variations to the indications of the instrument.

These variations are usually represented mathematically by applying statistical measures of

variation (such as standard deviation) to a series of observed indications. This projection is

based on an inductive argument and its precision therefore depends on the number of

indications observed during the first step of calibration.

Black-box calibration is useful when the behaviour of the device is already well-

understood and when the required accuracy is not too high. Because the calibration function

151

takes only one argument, namely the instrument indication (I0), the resulting quantity-value

estimate (O) is insensitive to other parameters that may influence the behaviour of the

instrument. Such parameters may have to do with interactions among part of the instrument,

the sample being measured, and the environment. They may also have to do with the

operation and reading of the instrument by humans, and with the way indications are

recorded and processed.

The neglect of these additional factors limits the ability to tell whether, and under what

conditions, a black-box calibration function can be expected to yield reliable predictions. As

long as the operating conditions of the instrument are sufficiently similar to calibration

conditions, one can expect the uncertainties associated with its calibration function to be

good estimates of the uncertainty of measurement outcomes. However, black-box

calibration represents the instrument too crudely to specify which conditions count as

‘sufficiently similar’. As a result, measurement outcomes generated through black-box

calibration are exposed to systematic errors that arise when measurement conditions change.

4.4. White-box calibration

4.4.1. Model construction

White-box calibration procedures represent the measurement process as a collection of

modules. This differs from the black-box approach to calibration, which treats the

measurement process as a single input/output unit. Each module is characterized by one or

more state parameters, laws of temporal evolution, and laws of interaction with other

152

modules. The collection of modules and laws constitutes a more detailed (but still idealized)

model of the measurement process than a black-box model.

Typically, a white-box model of a measuring process involves assumptions concerning:

(i) components of the measuring instrument and their mutual interactions (ii) the measured

sample, including its preparation and interaction with the instrument (iii) elements in the

environment (‘background effects’) and their interactions with both sample and instrument,

(iv) variability among human operators and (v) data recording and processing procedures.

Each of these five aspects may be represented by one or more modules, though not every

aspect is represented in every case of white-box calibration.

A white-box representation of a simple caliper measurement is found in Schwenke et

al. (2000, 396.) Figure 4.1 illustrates the modules and parameters involved. The measuring

instrument is represented by the component modules ‘leg’ and ‘scale’; the sample and its

interaction with the instrument by the modules ‘workpiece’ and ‘contact’, and the data by the

module ‘readout’. The environment is represented only indirectly by its influence on the

temperatures of the workpiece and scale, and variability among human operators is

completely neglected. Of course, one can easily imagine more or less detailed breakdowns of

a caliper into modules than the one offered here. The term ‘white-box’ should be

understood as referring to a wide variety of modular representations of the measurement

process with differing degrees of complexity, rather than a unique mode of representation101.

101 Simple modular representations are sometimes referred to as ‘grey-box’ models. See Boumans (2006, 121-2).

153

Figure 4.1: Modules and parameters involved in a white-box calibration of a simple caliper (Source: Schwenke et al 2000)

The multiplicity of modules in white-box representations means that additional

parameters are included in the forward and backward calibration functions, parameters that

mediate the relation between outcomes and indications. In the caliper example, these

parameters include the temperatures and thermal expansion coefficients of the workpiece

and scale, the roughness of contact between the workpiece and caliper legs, the Abbe-error

(‘wiggle room’) of the legs relative to each other, and the resolution of the readout. These

parameters are assumed to enter into various dependencies with each other as well as with

the quantity being measured and the indications of the instrument. Such dependencies are

specified in light of background theories and tested through secondary experiments on the

apparatus.

Engineers who design, construct and test precision measuring instruments typically

express these dependencies in mathematical form, i.e. as equations. Such equations represent

the laws of evolution and interaction among different modules in a manner that is amenable

to algebraic manipulation. The forward and backward calibration functions are then

obtained by solving this set of equations and arriving at a general dependency relation

154

among model parameters102. The general form of a white-box forward calibration function

is:

I0 = fFC (O , I1 , I2 , I3 , … In) (4.5)

where I0 is the model’s prediction concerning the processed indication of the

instrument, O the quantity being measured, and I1,… In additional parameters. As before, O is

obtained by reference to a measurement standard whose associated uncertainties may for the

time being be neglected. The additional parameter values in the forward function are

estimated by performing additional measurements on the instrument, sample and

environment, e.g. by measuring the temperatures of the caliper and the workpiece, the

roughness of the contact etc.

4.4.2. Uncertainty estimation

In the first step of white-box calibration, the forward function is derived from model

equations and tested against the actual behaviour of the instrument. Much like the black-box

case, testing involves recording the indications produced by the instrument in response to a

set of standard samples, and comparing these indications with the indications I0 predicted by

the forward function. But the analysis of residual errors is more complex in the white-box

case, because the instrument is represented in a more detailed way. On the one hand,

102 For the set of equations representing the caliper measurement process and a derivation of its forward calibration function see Schwenke et al. (2000, 396), eq. (3) and (4).

155

deviations between actual and predicted indications may be treated as uncontrolled (so-called

‘random’) variations to the measurement process. Just like the black-box case, such

deviations are accounted for by modeling the residual errors statistically and arriving at a

measure of their probability distribution. Uncertainties evaluated in this way are labelled

‘type-A’ in the metrological literature103. On the other hand, observed indications may also

deviate from predicted indications because these predictions are based on erroneous

estimates of additional parameters I1,… In. A common source of uncertainty when predicting

indications is the fact that additional parameters I1,… In are estimated by performing

secondary measurements, and these measurements suffer from uncertainties of their own.

The effects of these ‘type-B’ uncertainties are evaluated by propagating them through the

model’s equations to the predicted indication I0. This alternative way of evaluating

uncertainty is not available under a black-box representation of the instrument because such

representation neglects the influence of additional parameters. In white-box calibration, by

contrast, both type-A and type-B methods are available and can be used in combination to

explain the total deviation between observed and predicted indications.

An example of the propagation of type-B uncertainties has already been discussed in

Chapter 1, namely the method of uncertainty budgeting. Individual uncertainty contributions are

evaluated separately and then summed up in quadrature (that is, as a root sum of squares). A

crucial assumption of this method is that the uncertainty contributions are independent of

each other.

Table 4.1 is an example of an uncertainty budget drawn for a measurement of the

Newtonian gravitational constant G with a torsion pendulum (Luo et al. 2009). In a

103 For details see JCGM (2008a).

156

contemporary variation on the 1798 Cavendish experiment, the pendulum is suspended in a

vacuum between two masses, and G is measured by determining the difference in torque

exerted on the pendulum at different mass-pendulum alignments. The white-box

representation of the apparatus is composed of several modules (pendulum, masses, fibre

etc.) and sub-modules, each associated with one or more quantities whose estimation

contributes uncertainty to the measured value of G. The last item in the budget is the

‘statistical’, namely type-A uncertainty arising from uncontrolled variations. The total

uncertainty associated with the measurement is then calculated as the quadratic sum of

individual uncertainty contributions.

Table 4.1: Uncertainty budget for a torsion pendulum measurement of G, the Newtonian gravitational constant. Values are expressed in 10-6. A diagram of the apparatus appears on the right.

(Source: Luo et al. 2009, 3)

157

The method of uncertainty budgeting is computationally simple. As long as the

uncertainties of different input quantities are assumed to be independent of each other, their

propagation to the measurement outcome can be calculated analytically. A more

computationally challenging case occurs when model parameters depend on each other in

nonlinear ways, thereby making it difficult or impossible to propagate the uncertainties

analytically. In such cases uncertainty estimates can sometimes be derived through computer

simulation. This is the case when metrologists attempt to calibrate coordinate measuring

machines (CMMs), i.e. instruments that measure the shape and texture of three-dimensional

objects by recording a series of coordinates along their surface. These instruments are

calibrated by constructing idealized models that represent aspects of the instrument

(amplifier linearity, probe tip radius), the sample (roughness, thermal expansion), the

environment (frame vibration) and the data acquisition mechanism (sampling algorithm.)

Each such input parameter has a probability density function associated with it. The model

along with the probability density functions then serve to construct a Monte Carlo

simulation that samples the input distributions and propagates the uncertainty to

measurement outcomes (Schwenke et al 2000, Trenk et al 2004). This sort of computer-

simulated calibration has become so prevalent that in 2008 the International Bureau of

Weights and Measures (BIPM) published a ninety-page supplement to its “Guide to the

Expression of Uncertainty in Measurement” dealing solely with Monte Carlo methods

(JCGM 2008b)104.

104 The topic of uncertainty propagation is, of course, much more complex than the discussion here is able to cover. Apart from the methods of uncertainty budgeting and Monte Carlo several other methods of uncertainty propagation are commonly applied to physical measurement, including the Taylor method (Taylor 1997), probability bounds analysis, and Bayesian analysis (Draper 1995.)

158

During uncertainty evaluation, the forward calibration function is iteratively tested for

compatibility with observed indications. Deviations from the predicted indications that fail

to be accounted for by either type-A or type-B methods are usually a sign that the white-box

model is misrepresenting the measurement process. Sources of potential misrepresentation

include, for example, a neglected or insufficiently controlled background effect, an

inadequate statistical model of the variability in indications, an error in measuring an

additional parameter, or an overly simplified representation of the interaction between

certain modules. Much like other cases of scientific modeling and experimenting, white-box

calibration involves iterative modifications to the model of the apparatus as well as to the

apparatus itself in an attempt to account for remaining deviations. The stage at which this

iterative process is deemed complete depends on the degree of measurement accuracy

required and on the ability to physically control the values of additional parameters.

4.4.3. Projection

Once sufficiently improved to account for deviations, the white-box model is

projected beyond the circumstances of calibration onto the circumstances that are presumed

to obtain during measurement. This is the second step of calibration, which involves the

derivation of a backward function from model equations. The general form of a white-box

backward calibration function is:

O = fC (I0 , I1 , I2 , I3 , … In) (4.6)

159

In general, a white-box calibration function cannot be obtained by inverting the

forward function, but requires a separate derivation. Nevertheless, the additional parameters

I1…In are often presumed to be constant and equal (within uncertainty) to the values they

had during the first step of calibration. For example, metrologists assume that the caliper will

be used to measure objects whose temperature and roughness are the same (within

uncertainty) as those of the workpieces that were used to calibrate it. This assumption of

constancy has a double role. First, it allows metrologists to easily obtain a calibration

function by inverting the forward function. Second, the assumption of constancy is

epistemically important, as it specifies the scope of projectability of the calibration function. The

function is expected to predict measurement outcomes correctly only when additional

parameters I1…In fall within the value ranges specified. If circumstances differ from this

narrow specification, it is necessary to derive a new calibration function for these new

circumstances prior to obtaining measurement outcomes.

This last point sheds light on an important difference between black-box and white-

box calibration: they involve different trade-offs between predictive generality and predictive

accuracy. Black-box calibration models predict the outcomes of measuring procedures under

a wide variety of circumstances, but with relatively low accuracy, as they fail to take into

account local factors that intervene on the relation between indications and outcomes.

White-box calibration operations, on the other hand, specify such local factors narrowly, but

their predictions are projectable only within that narrow scope. Of course, a continuum lies

between these two extremes. A white-box calibration function can be made more general by

widening the specified value range of its additional parameters or by considering fewer such

160

additional parameters. In so doing, the uncertainty associated with its predictions generally

increases.

4.4.4. Predictability, not just correlation

Another epistemically important difference between black-box and white-box

calibration is the role played by measurement standards in each case. In black-box

calibration, one attempts to obtain a stable correlation between the processed indications of

the measuring instrument and standard values of the quantity to be measured. By ‘stable’ I

mean repeatable over many runs; by ‘correlation’ I mean a mapping between two variables

that is unique (i.e. bijective) up to some stated uncertainty. For example, the black-box

calibration of a caliper is considered successful if a stable correlation is obtained between its

readout and the number of 1-milimiter standard blocks concatenated between caliper legs.

This may lead one to hastily conclude that obtaining such correlations is necessary and

sufficient for successful calibration. But this last claim does not generalize to white-box

calibration, where one attempts to obtain a stable correlation between the processed

indications of the measuring instrument and the predictions of an idealized model of the

measurement process105. A correlation of the first sort does not imply a correlation of the

second sort.

105 A different way of phrasing this claim would be to say that the idealized model itself functions as a measurement standard, though this way of talking deviates from the way metrologists usually use the term ‘measurement standard’.

161

To see the point, recall that during the first step of white-box calibration one accounts

for deviations between observed indications and the indications predicted by the forward

calibration function. The forward function is derived from equations specified by an

idealized model of the measurement process. The total uncertainty associated with these

deviations is accordingly a measure of the predictability of the behaviour of the measurement

process by the model. Recall further that the indications I0 predicted by a white-box model

depend not only on standard quantity values O but also on a host of additional parameters

I1…In , as well as on laws of evolution and interaction among modules. Consequently, a mere

correlation between observed indications and standard quantity values is insufficient for

successful white-box calibration. To be deemed predictable under a given white-box model,

indications should also exhibit the expected dependencies on the values of additional

parameters.

As an example, consider the caliper once more. If the standard gauge blocks are

gradually heated as they are concatenated, theory predicts that the indications of the caliper

will deviate from linear dependence on the total length of the blocks due to the uneven

expansion rates of the blocks and the caliper. Now suppose that this nonlinearity fails to be

detected empirically – that is, the caliper’s indications do not display the sensitivity to

temperature predicted by its white-box model but instead remain linearly correlated to the

total length of the gauge blocks. It is tempting to conclude from this that the caliper is more

accurate than previously thought. This would be a mistake, however, for accuracy is a

property of an inference from indications to outcomes, and this inference has proved

inaccurate in our case. Instead, the right conclusion from such empirical finding in the

context of white-box calibration is that the model of the caliper is in need of correction. It

may be that the dependency of indications on temperature has a different coefficient than

162

presumed, or that a hidden background effect cancels out the effects of thermal expansion,

etc. Unless an adequate correction is made to the model of the caliper, the uncertainty

associated with its predictions – and hence with the outcomes of caliper measurements -

remains high despite the linear correlation between indications and standard quantity values.

It is the overall predictive uncertainty of the model, rather than the correlation of

indications with standard values, that determines the uncertainty of measurement outcomes.

We already saw that in the second step of calibration model assumptions are projected

beyond the calibration phase and used to predict measurement outcomes. The total

uncertainty associated with measurement outcomes then expresses the likelihood that the

measured quantity value will fall in a given range when the indications of the instrument are

such-and-such. In other words, measurement uncertainty is a measure of the predictability of

measurement outcomes under an idealized model of the measurement process, rather than a

measure of closeness of correlation between the observed behaviours of the instrument and

values supplied by standards.

This conclusion may be generalized to black-box calibration. Black-box calibration is,

after all, a special case of white-box calibration where additional parameters are neglected.

All sources of uncertainty are represented as uncontrolled deviations from the expected

correlation between indications and standard values, and evaluated through type-A

(‘statistical’) methods. A black-box model, in other words, is a coarse-grained representation

of the measuring process under which measurement uncertainty and closeness of correlation

with a standard happen to coincide. Nevertheless, in both black- and white-box cases theoretical

and statistical considerations enter into one’s choice of model assumptions, and in both

cases total measurement uncertainty is a measure of the predictability of the outcome under

those assumptions. Black-box calibration is simply one way to ground such predictions, by

163

making data-driven empirical generalizations about the behaviour of an instrument. Such

generalizations suffer from higher uncertainties and a fuzzier scope than the predictions of

white-box models, but have the same underlying inferential structure.

The emphasis on predictability distinguishes the model-based account from narrower

conceptions of calibration that view it as a kind of reproducibility test. Allan Franklin, for

example, defines calibration as “the use of a surrogate signal to standardize an instrument.”

(1997, 31). Though he admits that calibration sometimes involves complex inferences, in his

view the ultimate goal of such inferences is to ascertain the ability of the apparatus to

reproduce known results associated with standard samples (‘surrogate signals’). A similar

view of calibration is expressed by Woodward (1989, 416-8). These restrictive views treat

calibration as an experimental investigation of the measuring apparatus itself, rather than an

investigation of the empirical consequences of modelling the apparatus under certain

assumptions. Hence Franklin seems to claim that, at least in simple cases, the success or

failure of calibration procedures are evident through observation. The calibration of a

spectrometer, for example, is understood by Franklin as a test for the reproducibility of

known spectral lines as seen on the equipment’s readout (Franklin 1997, 34). Such views fail

to recognize that even in the simplest cases of calibration one still needs to make idealized

assumptions about the measurement process. Indeed, unless the instrument is already

represented under such assumptions reproducibility tests are useless, as there are no grounds

for telling whether a similarity of indications should be taken as evidence for a similarity in

outcomes, and whether the behaviour of the apparatus can be safely projected beyond the

test stage. Despite this, restrictive views neglect the representational aspect of calibration and

only admit the existence of an inferential dimension to calibration in special and highly

complex cases (ibid, 75).

164

4.5. The role of standards in calibration

4.5.1. Why standards?

As I have argued so far, the ultimate goal of calibration is to predict the outcomes of a

measuring procedure under a specified set of circumstances. This goal is only partially served

by establishing correlations with standard values, and must be complemented with a detailed

representation of the measurement process whenever high accuracy is required. This line of

argument raises two questions concerning the role of measurement standards. First, how

does the use of standards contribute to the accurate prediction of measurement outcomes?

Second, is establishing a correlation between instrument indications and standard values

necessary for successful calibration?

The simple answer to the first question is that standards supply reference values of the

quantity to be measured. That is, they supply values of the variable O that are plugged into

the forward calibration function, thereby allowing predictions concerning the instrument’s

indications to be tested empirically. But this answer is not very informative by itself, for it

does not explain why one ought to treat the values supplied by standards as accurate. In the

previous sections we simply assumed that standards provide accurate values, despite the fact

that (as already shown in Chapter 1) even the most accurate measurement standards have

nonzero uncertainties. Given that the procedures metrologists call ‘standards’ are not

absolutely accurate, is there any reason to use them for estimating values of O rather than

any other procedure that measures the same quantity?

As I am about to show, the answer depends on whether the question is understood in

a local or global context. Locally – for any given instance of calibration – it makes no

165

epistemic difference whether one calibrates against a metrological standard or against some

other measuring procedure, provided that the uncertainty associated with its outcomes is

sufficiently low. By contrast, from a global perspective – when the web of inter-procedural

comparisons is considered as a whole – the inclusion of metrological standards is crucial, as

it ensures that the procedures being compared are measuring the quantity they are intended

to.

4.5.2. Two-way white-box calibration

Let us begin with the local context, and consider any particular pair of procedures - call

them the ‘calibrated’ and ‘reference’ procedures. During calibration, the reference procedure

is used to measure values of O associated with certain samples; these values are plugged into

the forward function of the calibrated instrument and used to predict its indications; and

these predictions are then compared to the actual indications produced by the calibrated

instrument in response to the same (or similar) samples. For the sake of carrying out this

procedure, it makes no difference whether the reference procedure is a metrologically

sanctioned standard or not, because the accuracy of metrologically sanctioned standards is

evaluated in exactly the same way as the accuracy of any other measurement procedure. The

uncertainties associated with standard measuring procedures are evaluated by constructing

white-box models of those standards, deriving a forward calibration function, propagating

uncertainties through the model, and testing model predictions for compatibility with other

standards.

166

Table 4.2: Type-B uncertainty budget for NIST-F1, the US primary frequency standard. The clock is deemed highly accurate despite the large discrepancy between its indications and the cesium clock

frequency, because the correction factor is accurately predictable. (Source: Jefferts et al, 2007, 766)

This has already been shown in Chapter 1. For example, the cesium fountain clock

NIST-F1, which serves as the primary frequency standard in the US, has a fractional

frequency uncertainty of less than 5 parts in 1016 (Jefferts et al. 2007). This uncertainty is

evaluated by modelling the clock theoretically and statistically, drawing an uncertainty budget

for the clock, and testing these uncertainty estimates for compatibility with other cesium

fountain clocks106. Table 4.2 is a recent uncertainty budget for NIST-F1 (including only

type-B evaluations). Note that the systematic corrections applied to the clocks indications

(the total frequency ‘bias’) far exceed the total type-B uncertainty associated with the

106 By ‘modeling the clock statistically’ I mean making statistical assumptions about the variation of its indications over time. These assumptions are used to construct models of noise, such as white noise, flicker noise and Brownian noise (see also section 3.4.2.)

167

outcome. In other words, the clock ‘ticks’ considerably faster – by hundreds of standard

deviations – than the cesium frequency it is supposed to measure. The clock is nevertheless

deemed highly accurate, because the cesium frequency is predictable from clock indications

(‘ticks’) with a very low uncertainty.

Now consider a scenario in which a cesium fountain clock is used to calibrate a

hydrogen spectrometer, i.e. a device measuring the frequency associated with subatomic

transitions in hydrogen. Such calibration is described in Niering et al (2000). The accuracy

expected of the spectrometer is close to that of the standard, so that one cannot neglect the

inaccuracies associated with the standard during calibration. In this case metrologists must

consider two white-box models – one for the calibrated instrument and one for the standard

– and compare the measurement outcomes predicted by each model. These predicted

measurement outcomes already incorporate bias corrections and are associated with

estimates of total uncertainty propagated through each model. The calibration is then

considered successful if and only if the outcomes of the two clocks, as predicted by their

respective models, coincide within their respective uncertainties:

fC (I0 , I1 , I2 , I3,… In) ≈ fC’ (I0

’ , I1’ , I2

’ , I3’,… Im

’ ) (4.7)

where fC is the calibration function associated with the hydrogen spectrometer, fC’ is the

calibration function associated with the cesium standard, and ≈ stands for ‘compatible up to

stated uncertainty’. Notice the complete symmetry between the calibrated instrument and

the standard as far as the formal requirement for successful calibration goes. This symmetry

168

expresses the fact that a calibrated procedure and a reference procedure differ only in their

degree of uncertainty and not in the way uncertainty is evaluated.

This ‘two-way white-box’ procedure exemplifies calibration in its full generality, where

both the calibrated instrument and the reference are represented in a detailed manner. As

equation (4.7) makes clear, calibration is successful when it establishes a predictable

correlation between the outcomes of measuring procedures under their respective models,

rather than between their observed indications. Such correlation amounts to an empirical

confirmation that the predictions of different calibration functions are mutually compatible.

4.5.3. Calibration without metrological standards

The importance of reference procedures in calibration, then, is the fact that they are

modelled more accurately – namely, with lower predictive uncertainties – than the

procedures they are used to calibrate. It is now easy to see that a reference procedure does

not have to be a metrological standard, but instead may be any measurement procedure

whose uncertainties are sufficiently low to make the comparison informative. We already

saw an example of successful calibration without any reference to a metrological standard in

Chapter 1, where two optical clocks were calibrated against each other. In that case both

clocks had significantly lower measurement uncertainties than the most accurate

metrologically sanctioned frequency standard. More mundane examples of calibration

without metrological standards are found in areas where an institutional consensus has not

yet formed around the proper application of the measured concept. Paper quality is an

example of a complex vector of quantities (including fibre length, width, shape and

169

bendability) for which an international reference standard does not yet exist (Wirandi and

Lauber 2006). The instruments that measure these quantities are calibrated against each

other in a ‘round-robin’ ring, without a central reference standard (see Figure 4.2). This

procedure sufficiently guarantees that the outcomes of measurements taken in one

laboratory reliably predict the outcomes obtained for a similar sample at any other

participating laboratory.

Figure 4.2: A simplified diagram of a round-robin calibration scheme. Authorized laboratories calibrate their measuring instruments against each other’s, without reference to a central standard.

(source: Wirandi and Lauber 2006, 616)

These examples of ‘symmetrical’ calibration all share a common inferential structure,

namely, they establish the accuracy of measuring procedures through a robustness argument.

The predictions derived from models of multiple measuring procedures are tested for

compatibility with each other, and those that pass the test are taken to be accurate up to

their respective uncertainties. As I will argue in the next section, this inferential structure is

essential to calibration and present even in seemingly asymmetrical cases where the

inaccuracy of standards is neglected.

Before discussing robustness, let me return to the role of metrological standards in

calibration. From a local perspective, as we have already seen, metrological standards are not

qualitatively different from other measuring procedures in their ability to produce reference

170

values for calibration. Metrological standards are useful only insofar as the uncertainty

associated with their values is low enough to for calibration tests to be informative. Of

course, the values provided by metrological standards are usually associated with very low

uncertainties. But it is not by virtue of their role as standards that their uncertainties are

deemed low. The opposite is the case: the ability to model certain procedures in a way that

facilitates accurate predictions of their outcomes motivates metrologists to adopt such

procedures as standards, as already discussed in Chapter 1.

Above I posed the question: is establishing correlation with standard values necessary

for successful calibration? A partial answer can now be given. Insofar as local instances of

calibration are concerned, it is not necessary to appeal to metrologically sanctioned standards

in order to evaluate uncertainty. Establishing correlations among outcomes of different

measuring procedures is sufficient for this goal. One may, of course, call some of the

procedures being compared ‘standards’ insofar as they are modelled more accurately than

others. But this designation does not mark any qualitative epistemic difference between

standard and non-standard procedures.

4.5.4. A global perspective

The above is not meant to deny that choices of metrological standards carry with them

a special normative force. There is still an important difference between calibrating an

instrument against a non-standard reference procedure and calibrating it against a

metrological standard, even when both reference procedures are equally accurate. The

171

difference is that in the first case, if significant discrepancies are detected between the

outcomes of either procedure, either procedure is in principle equally amenable to

correction. All other things being equal, the models from which a calibration function is

derived for either procedure are equally revisable. This is not the case if the reference

procedure is a metrological standard, because a model representing a standard procedure has

a legislative function with respect to the application of the quantity concept in question.

This legislative (or ‘coordinative’) function of metrological models has been discussed

at length in Chapter 3. The theoretical and statistical assumptions with which a metrological

standard is represented serve a dual, descriptive and normative role. On the one hand, they

predict the actual behaviour of the process that serves as a standard, and on the other, they

prescribe how the concept being standardized is to be applied to that process. Metrological

standards can fulfill this legislative function because they are modelled in terms of the

theoretical definition of the relevant concept, that is, they constitute realizations of that

concept. For this reason, in the face of systematic discrepancies between the outcomes of

standard and non-standard procedures, there is a good reason to prefer a correction to the

outcomes of non-standard procedures over a correction to the outcomes of standard

procedures.

Note that this preference does not imply that metrological standards are categorically

more accurate, or accurate for different reasons, than other measuring procedures. The total

uncertainty associated with a measuring procedure is still evaluated in exactly the same way

whether or not that procedure is a metrological standard. But the second-order uncertainty

associated with metrological standards – that is, the uncertainty associated with evaluations

of their uncertainty – is especially low. This is the case because metrological standards are

modelled in terms of the theoretical definition of the quantity they realize, and their

172

uncertainties are accordingly estimates of the degree to which the realization succeeds in

approximately satisfying the definition. Such uncertainty estimates enjoy a higher degree of

confidence than those associated with non-standard measuring procedures, because the

latter are not directly derived from the theoretical definition of the measured quantity and

cannot be considered equally safe estimates of the degree of its approximate satisfaction. For

this reason, the assumptions under which non-standard measuring procedures are modelled

are usually deemed more amenable to revision than the assumptions informing the modeling

of metrological standards. This is the case even when the non-standard is thought to be

more accurate, i.e. to have lower first-order uncertainty, than the standard. For example, if

an optical atomic clock were to systematically disagree with a cesium standard, the model of

the former would be more amenable to revision despite it supposedly being the more

accurate clock.

The importance of the normative function of metrological standards is revealed from a

global perspective on calibration, when one views the web of inter-procedural comparisons

as a whole. Here metrological standards form the backbone that holds the web together by

providing a stable reference for correcting systematic errors. The consistent distribution of

systematic errors across the web makes possible its subsumption under a single quantity

concept, as explained in Chapters 2. In the absence of a unified policy for distributing errors,

nothing prevents a large web from breaking into ‘islands’ of intra-comparable but mutually

incompatible procedures. By legislating how an abstract quantity concept is to be realized,

models of metrological standards serve as a kind of ‘semantic glue’ the ties together distant

parts of the web.

As an example, consider all the clocks calibrated against Coordinated Universal Time

either directly or indirectly, e.g. through national time signals. What justifies the claim that all

173

these clocks are measuring, with varying degrees of accuracy, the same quantity – namely,

time on a particular atomic scale? The answer is that all these clocks produce consistent

outcomes when modelled in terms of the relevant quantity, i.e. UTC. But to test whether

they do, one must first determine what counts as an adequate way of applying the concept of

Coordinated Universal Time to any particular clock. This is where metrological standards

come into play: they fix a semantic link between the definition of the quantity being

standardized and each of its multiple realizations. In the case of UTC, this legislation is

performed by modeling a handful of primary frequency standards and several hundred

secondary standards in a manner that minimizes their mutual discrepancies, as described in

Chapter 3. It then becomes possible to represent non-standard empirical procedures such as

quartz clocks in terms of the standardized quantity by correcting their systematic errors

relative to the network’s backbone. In the absence of this ongoing practice of correction, the

web of clocks would quickly devolve into clusters that measure mutually incompatible

timescales.

From a global perspective, then, metrological standards still play an indispensible

epistemic role in calibration whenever (i) the web of instruments is sufficiently large and (ii)

the quantity being measured is defined theoretically. This explains why metrological rigour is

necessary for standardizing quantities that have reached a certain degree of theoretical

maturity. At the same time, the analysis above explains why metrological standards are

unnecessary for successful calibration in case of ‘nascent’ quantities such as paper quality.

174

4.6. From predictive uncertainty to measurement accuracy

We saw above (Section 4.4.) that measurement uncertainty is a kind of predictive

uncertainty. That is, measurement uncertainty is the uncertainty associated with predictions

of the form: “when the measuring instrument produces indication i the value of the

measured quantity will be o.” Such predictions are derived during calibration from statistical

and theoretical assumptions about the measurement process. Calibration tests proceed by

comparing the outcomes predicted by a model of one measuring procedure (the ‘calibrated’

procedure) to the outcomes predicted by a model of another measuring procedure (the

‘reference’ procedure). When the predicted outcomes agree within their stated uncertainties,

calibration is deemed successful. This success criterion is expressed by equation (4.7).

At first glance it seems that calibration should only be able to provide estimates of

consistency among predictions of measurement outcomes. And yet metrologists routinely use

calibration tests to estimate the accuracy of outcomes themselves. That is, they infer from

the mutual consistency among predicted outcomes that the outcomes are accurate up to

their stated uncertainties. The question arises: why should estimates of consistency among

outcomes predicted for different measuring procedures be taken as good estimates of the

accuracy of those outcomes?

The general outline of the answer should already be familiar from my discussion of

robustness in Chapter 1. There I showed how robustness tests of the form (RC), performed

among multiple realizations of the same measurement unit, ground claims to the accuracy of

those realizations. I further showed that this conclusion holds regardless of the particular

meaning of ‘accuracy’ employed – be it metaphysical, epistemic, operational, comparative or

pragmatic. The final move, then, is to expand the scope of (RC) to include measuring

175

procedures in general. The resulting ‘generalized robustness condition’ may be formulated in

the following way:

(GRC) Given multiple, sufficiently diverse processes that are used to measure

the same quantity, the uncertainties ascribed to their outcomes are

adequate if and only if

(i) discrepancies among measurement outcomes fall within their

ascribed uncertainties; and

(ii) the ascribed uncertainties are derived from appropriate models

of each measurement process.

Uncertainties that satisfy (GRC) are reliable measures of the accuracies of

measurement outcomes under all five senses of ‘measurement accuracy’, for the same

reasons that applied to (RC)107.

What remains to be clarified is how calibration operations test the satisfaction of

(GRC). Recall that calibration is deemed successful – that is, good at predicting the

outcomes of a measuring procedure to within the stated uncertainty – when the predicted

outcomes are shown to be consistent with those associated with a reference procedure. Now

consider an entire web of such successful calibration operations. Each ‘link’ in the web

stands for an instance of pairwise calibration, and is associated with some uncertainty that is

a combination of uncertainties from both calibrated and reference procedures. Assuming

that there are no cumulative systematic biases across the web, the relation of compatibility

107 See Chapter 1, Section 1.5: “A robustness condition for accuracy”.

176

within uncertainty ≈ can be assumed to be transitive108. Consequently, measurement

uncertainties that are vindicated by one pairwise calibration are traceable throughout the

web. The outcomes of any two measurement procedures in the web are predicted to agree

within their ascribed uncertainties even if they are never directly compared to each other.

The web of calibrations for a given quantity may therefore be considered an indirect

robustness test for the uncertainties associated with each individual measuring procedure.

Each additional calibration that successfully attaches its uncertainty estimates to the web

indirectly tests those estimates for compatibility with many other estimates made for a

variety of other measuring procedures. In other words, each additional calibration

constitutes an indirect test as to whether (GRC) is satisfied when the web of comparisons is

appended with a putative new member.

This conclusion holds equally well for black-box and white-box calibration, which are

but special cases of the fully general, two-way white-box case. To be sure, in these special

cases some of the complexities involved in deriving and testing model-based predictions

remain implicit. In the one-way white-box case one makes the simplifying assumption that

the behaviour of the standard is perfectly predictable. In the black-box case one additionally

makes the simplifying assumption that changes in extrinsic circumstances will not influence

the relation between indications and outcomes. These varying levels of idealization affect the

accuracy and generality with which measurement outcomes are predicted, but not the

general methodological principle according to which compatibility among predictions is the

ultimate test for measurement accuracy.

108 Note that this last assumption is only adequate when the web is small (i.e. small maximal distance among nodes) or when metrological standards are included in strategic junctions, as already discussed above.

177

4.7. Conclusions

This chapter has argued that calibration is a special sort of modelling activity. Viewed

locally, calibration is the complex activity of constructing, testing, deriving predictions from,

and propagating uncertainties through models of a measurement process. Viewed globally,

calibration is a test of robustness for model-based predictions of multiple measuring

processes. This model-based account of calibration solves the problem of accuracy posed in

the introduction to this thesis. As I have shown, uncertainty estimates that pass the

robustness test are reliable estimates of measurement accuracy despite the fact that the

accuracy of any single measuring procedure cannot be evaluated in isolation.

The key to the solution was to show that, from an epistemological point of view,

measurement accuracy is but a special case of predictive accuracy. As far as it is knowable,

the accuracy of a measurement outcome is the accuracy with which that outcome can be

predicted on the basis of a theoretical and statistical model of the measurement process. A

similar conclusion holds for measurement outcomes themselves, which are the results of

predictive inferences from model assumptions mediated through the derivation of a

calibration function. The intimate inferential link between measurement and prediction has

so far been ignored in the philosophical literature, and has potentially important

consequences for the relationship between theory and measurement.

178


Epilogue

In the introduction to this thesis I outlined three epistemological problems concerning

measurement: the problems of coordination, accuracy and quantity individuation. In each of

the chapters that followed I argued that these problems are solved (or dissolved) by

recognizing the roles models play in measurement. A precondition for measuring is the

coherent subsumption of measurement processes under idealized models. Such

subsumption is a necessary condition for obtaining objective measurement outcomes from

local and idiosyncratic instruments indications. In addition, I have shown that contemporary

methods employed in the standardization of measuring instruments indeed achieve the goal

of coherent subsumption. Hence the model-based account meets both the general and the

practice-based epistemological challenges set forth in the introduction.

A general evidential condition for testing measurement claims has emerged from my

studies, which may be called ‘convergence under representations’. Claims to measurement,

accuracy and quantity individuation are settled by testing whether idealized models

representing different measuring processes converge to each other. This convergence

requirement is two-pronged. First, the assumptions with which models are constructed have

to cohere with each other and with background theory. Second, the consequences of

representing concrete processes under these assumptions must converge in accordance with

their associated uncertainties. When this dual-aspect convergence is shown to be sufficiently

robust under alternations to the instrument, sample and environment, all three problems are

solved simultaneously. That is, a robust convergence among models of multiple instruments

179

is sufficient to warrant claims about (i) whether the instruments measure the same quantity,

(ii) which quantity the instruments measure and (iii) how accurately each of them measures

this quantity. Of course, such knowledge claims are never warranted with complete certainty.

The ‘sufficiency’ of robustness tests may always be challenged by a new perturbation that

destroys convergence and forces metrologists to revise their models. As a result, some

second-order uncertainty is always present in the characterization of measurement

procedures.

Claims about coordination, accuracy and quantity individuation are contextual, i.e.

pertain to instruments only as they are represented by specified models. This context-

sensitivity is a consequence of recognizing the correct scope of knowledge claims made on

the basis of measurements. As I have shown, measurement outcomes are themselves

contextual and relative to the assumptions with which measurement processes are modeled.

Similarly, the notions of agreement, systematic error and measurement uncertainty all

become clear once their sensitivity to representational context is acknowledged. This,

however, does not mean that measurement outcomes lose their validity outside of the

laboratory where they were produced. On the contrary, the condition of convergence under

representations explains why measurement outcomes are able to ‘travel’ outside of the

context of their production and remain valid across a network of inter-calibrated

instruments. The fact that these instruments converge under their respective models ensures

that measurement outcomes produced by using one instrument would be reproducible

across the network, thereby securing the validity of measurement outcomes throughout the

network’s scope.

The model-based account has potentially important consequences for several ongoing

debates in the philosophy of science, consequences which are beyond the purview of this

180

thesis. One such consequence, already noted at the end of Chapter 4, is the centrality of

prediction to measurement, a discovery which calls for subtler accounts of the relationship

between theory and measurement. Another important consequence concerns the possibility

of a clear distinction between hypotheses and evidence. As we saw above, measurement

outcomes are inferred by projection from hypotheses about the measurement process. Just

like any other projective estimate, the validity of a measurement outcome depends on the

validity of underlying hypotheses. Hence the question arises whether and why measurement

outcomes are better suited to serve as evidence than other projective estimates, e.g. the

outputs of predictive computer simulations. Finally, the very idea that scientific

representation is a two-place relation – connecting abstract theories or models with concrete

objects and events – is significantly undermined by the model-based account. Under my

analysis, whether or not an idealized model adequately represents a measurement process is a

question whose answer is relative to the representational adequacy of other models with

respect to other measurement processes. Hence the model-based account implies a kind of

representational coherentism, i.e. a diffusion of representational adequacy conditions across

the entire web of instruments and knowledge claims. These implications of the model-based

account must nevertheless await elaboration elsewhere.

181

Bibliography

Arias, Elisa F., and Gérard Petit. 2005. “Estimation of the duration of the scale unit of TAI

with primary frequency standards.” Proceedings of the IEEE International Frequency Control

Symposium 244-6.

Audoin, Claude, and Bernard Guinot. 2001. The Measurement of Time. Cambridge: Cambridge

University Press.

Azoubib, J., Granveaud, M. and Guinot, B. 1977. “Estimation of the Scale Unit of Time

Scales.” Metrologia 13: 87-93.

BIPM (Bureau International des Poids et Measures). 2006. The International System of Units

(SI). 8th ed. Sèvres: BIPM, http://www.bipm.org/en/si/si_brochure/

———. 2010. BIPM Annual Report on Time Activities. Vol. 5. Sèvres: BIPM,

http://www.bipm.org/utils/en/pdf/time_ann_rep/Time_annual_report_2010.pdf

———. 2011. Circular-T 282. Sèvres: BIPM,

ftp://ftp2.bipm.org/pub/tai/publication/cirt.282

Birge, Raymond T. 1932. “The Calculation of Errors by the Method of Least Squares.”

Physical Review 40: 207-27.

Boumans, Marcel. 2005. How Economists Model the World into Numbers. London: Routledge.

———. 2006. “The difference between answering a ‘why’ question and answering a ‘how

much’ question.” In Simulation: Pragmatic Construction of Reality, edited by Johannes

Lenhard, Günter Küppers, and Terry Shinn, 107-124. Dordrecht: Springer.

———. 2007. “Invariance and Calibration.” In Measurement in Economics: A Handbook, edited

by Marcel Boumans, 231-248. London: Elsevier.

Bridgman, Percy W. 1927. The logic of modern physics. New York: MacMillan.

182

———. 1959. “P. W. Bridgman's "The Logic of Modern Physics" after Thirty Years”,

Daedalus 88 (3): 518-526.

Campbell, Norman R. 1920. Physics: the Elements. London: Cambridge University Press.

Carnap, Rudolf. (1966) 1995. An Introduction to the Philosophy of Science. Edited by Martin

Gardner. NY: Dover.

Cartwright, Nancy. 1999. The Dappled World: A Study of the Boundaries of Science. Cambridge:

Cambridge University Press.

Cartwright, Nancy, Towfic Shomar, and Mauricio Suárez. 1995. “The Tool Box of Science.”

Poznan Studies in the Philosophy of the Sciences and the Humanities 44: 137-49.

Chang, Hasok. 2004. Inventing Temperature: Measurement and Scientific Progress. Oxford University

Press.

———. 2009. "Operationalism." In The Stanford Encyclopedia of Philosophy, edited by E.N.

Zalta, http://plato.stanford.edu/archives/fall2009/entries/operationalism/

Chang, Hasok, and Nancy Cartwright. 2008. “Measurement.” In The Routledge Companion to

Philosophy of Science, edited by Psillos, S. and Curd, M., 367-375. NY: Routledge.

Chakravartty, Anjan. 2007. A metaphysics for scientific realism: knowing the unobservable. Cambridge

University Press.

CGPM (Conférence Générale des Poids et Mesures). 1961. Proceedings of the 11th CGPM.

http://www.bipm.org/en/CGPM/db/11/6/

Diez, Jose A. 2002. “A Program for the Individuation of Scientific Concepts.” Synthese 130:

13-48.

Duhem, Pierre M. M. (1914) 1991. The aim and structure of physical theory. Princeton University

Press.

183

Draper, David. 1995. “Assessment and Propagation of Model Uncertainty.” Journal of the

Royal Statistical Society. Series B (Methodological) 57 (1): 45-97.

Ellis, Brian. 1966. Basic Concepts of Measurement. Cambridge University Press.

Franklin, Allan. 1997. “Calibration.” Perspectives on Science 5 (1): 31-80.

Frigerio, Aldo, Alessandro Giordani, and Luca Mari. 2010. “Outline of a general model of

measurement.” Synthese 175: 123-149.

Galison, Peter. 2003. Einstein’s Clocks, Poincaré’s Maps: Empires of Time. W.W. Norton.

Gerginov, Vladislav, N. Nemitz, S. Weyers, R. Schröder, D. Griebsch, and R. Wynands.

2010. “Uncertainty evaluation of the caesium fountain clock PTB-CSF2.” Metrologia 47:

65-79.

Girard, G. 1994. “The Third Periodic Verification of National Prototypes of the Kilogram

(1988- 1992).” Metrologia 31: 317-36.

Gooday, Graeme J. N. 2004. The Morals of Measurement: Accuracy, Irony, and Trust in Late

Victorian Electrical Practice. Cambridge: Cambridge University Press.

Hacking, Ian. 1999. The Social Construction of What? Harvard University Press.

Henrion, Max and Baruch Fischhoff. .1986. “Assessing Uncertainty in Physical Constants.”

American Journal of Physics 54 (9): 791-8.

Heavner, T.P., S.R. Jefferts, E.A. Donley, J.H. Shirley and T.E. Parker. 2005. “NIST-F1:

recent improvements and accuracy evaluations.” Metrologia 42: 411-422.

Hempel, Carl G. 1966. Philosophy of Natural Science. NJ: Prentice-Hall.

Hon, Giora. 2009. “Error: The Long Neglect, the One-Sided View, and a Typology.” In

Going Amiss in Experimental Research, edited by G. Hon, J. Schickore and F. Steinle. Vol.

267 of Boston Studies in the Philosophy of Science, 11-26. Springer.

184

JCGM (Joint Committee for Guides in Metrology). 2008. International Vocabulary of Metrology.

3rd edition. Sèvres: JCGM, http://www.bipm.org/en/publications/guides/vim.html

———. 2008a. Guide to the Expression of Uncertainty in Measurement. Sèvres: JCGM,

http://www.bipm.org/en/publications/guides/gum.html

———. 2008b. Evaluation of measurement data — Supplement 1 to the ‘Guide to the expression of

uncertainty in measurement’— Propagation of distributions using a Monte Carlo method. Sèvres:

JCGM, http://www.bipm.org/en/publications/guides/gum.html

Jefferts, S.R., J. Shirley, T. E. Parker, T. P. Heavner, D. M. Meekhof, C. Nelson, F. Levi, G.

Costanzo, A. De Marchi, R. Drullinger, L. Hollberg, W. D. Lee and F. L. Walls. 2002.

“Accuracy evaluation of NIST-F1.” Metrologia 39: 321-36.

Jefferts, S.R., T. P. Heavner, T. E. Parker and J.H. Shirley. 2007. “NIST Cesium Fountains –

Current Status and Future Prospects.” Acta Physica Polonica A 112 (5): 759-67.

Krantz, D. H., P. Suppes, R. D. Luce, and A. Tversky. 1971. Foundations of measurement:

Additive and polynomial representations. Dover Publications.

Kripke, Saul A. 1980. Naming and Necessity. Harvard University Press.

Kuhn, Thomas S. (1961) 1977. “The Function of Measurement in Modern Physical

Sciences.” In The Essential Tension: Selected Studies in Scientific Tradition and Change, 178-

224. Chicago: University of Chicago Press.

Kyburg, Henry E. 1984. Theory and Measurement. Cambridge University Press.

Latour, Bruno. 1987. Science in Action. Harvard University Press.

Li, Tianchu et al. 2004. “NIM4 cesium fountain primary frequency standard: performance

and evaluation.” IEEE International Ultrasonics, Ferroelectrics, and Frequency Control, 702-5.

185

Lombardi, Michael A., Thomas P. Heavner and Steven R. Jefferts. 2007. “NIST Primary

Frequency Standards and the Realization of the SI Second.” Measure: The Journal of

Measurement Science 2 (4): 74-89.

Luce, R.D. and Suppes, P. 2002. “Representational Measurement Theory.” In Stevens'

Handbook of Experimental Psychology, 3rd Edition, edited by J. Wixted and H. Pashler,

Vol. 4: Methodology in Experimental Psychology, 1-41. New York: Wiley.

Luo, J. et al. 2009. “Determination of the Newtonian Gravitational Constant G with Time-

of-Swing Method.” Physical Review Letters 102 (24): 240801.

Mach, Ernst. (1896) 1966. “Critique of the Concept of Temperature.” In: Brian Ellis, Basic

Concepts of Measurement, 183-96. Cambridge University Press.

Mari, Luca. 2000. “Beyond the representational viewpoint: a new formalization of

measurement.” Measurement 27: 71-84.

———. 2005. “Models of the Measurement Process.” In Handbook of Measuring Systems

Design, edited by P. Sydenman and R. Thorn, Vol. 2, Ch. 104. Wiley.

McMullin, Ernan. 1985. “Galilean Idealization.” Studies in History and Philosophy of Science 16

(3): 247-73.

Michell, Joel. 1994. “Numbers as Quantitative Relations and the Traditional Theory of

Measurement.” British Journal for the Philosophy of Science 45: 389-406.

Morgan, Mary. 2007. “An Analytical History of Measuring Practices: The Case of Velocities

of Money.” In Measurement in Economics: A Handbook, edited by Marcel Boumans, 105-

132. London: Elsevier.

Morrison, Margaret. 1999. “Models as Autonomous Agents.” In Models as Mediators:

Perspectives on Natural and Social Science, edited by Mary Morgan and Margaret Morrison,

38-65. Cambridge: Cambridge University Press.

186

———. 2009. “Models, measurement and computer simulation: the changing face of

experimentation.” Philosophical Studies 143 (1): 33-57.

Morrison, Margaret, and Mary Morgan. 1999. “Models as Mediating Instruments.”, In Models

as Mediators: Perspectives on Natural and Social Science, edited by Mary Morgan and

Margaret Morrison, 10-37. Cambridge: Cambridge University Press.

Niering, M. et al. 2000. “Measurement of the Hydrogen 1S-2S Transition Frequency by

Phase Coherent Comparison with a Microwave Cesium Fountain Clock.” Physical

Review Letters 84(24): 5496.

Panfilo, G. and E.F. Arias. 2009. “Studies and possible improvements on EAL algorithm.”

IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control (UFFC-57), 154-160.

Parker, Thomas. 1999. “Hydrogen maser ensemble performance and characterization of

frequency standards.” Joint meeting of the European Frequency and Time Forum and the IEEE

International Frequency Control Symposium, 173-6.

Parker, T., P. Hetzel, S. Jefferts, S. Weyers, L. Nelson, A. Bauch, and J. Levine. 2001. “First

comparison of remote cesium fountains.” 2001 IEEE International Frequency Control

Symposium 63-68.

Petit, G. 2004. “A new realization of terrestrial time.” 35th Annual Precise Time and Time Interval

(PTTI) Meeting, 307-16.

Pickering, Andrew. 1995. The Mangle of Practice: Time, Agency and Science. Chicago and London:

University of Chicago Press.

Poincaré, Henri. (1898) 1958. “The Measure of Time.” In: The Value of Science, 26-36. New

York: Dover.

187

Quinn, T.J. 2003. “Open letter concerning the growing importance of metrology and the

benefits of participation in the Metre Convention, notably the CIPM MRA.”,

http://www.bipm.org/utils/en/pdf/importance.pdf

Record, Isaac. 2011. “Knowing Instruments: Design, Reliability and Scientific Practice.”

PhD diss., University of Toronto.

Reichenbach, Hans. (1927) 1958. The Philosophy of Space and Time. Courier Dover Publications.

Schaffer, Simon. 1992. “Late Victorian metrology and its instrumentation: a manufactory of

Ohms.” In Invisible Connections: Instruments, Institutions, and Science, edited by Robert Bud

and Susan E. Cozzens, 23-56. Cardiff: SPIE Optical Engineering.

Schwenke, H., B.R.L. Siebert, F. Waldele, and H. Kunzmann. 2000. “Assessment of

Uncertainties in Dimensional Metrology by Monte Carlo Simulation: Proposal for a

Modular and Visual Software.” CIRP Annals - Manufacturing Technology 49 (1): 395-8.

Suppes, Patrick. 1960. “A Comparison of the Meaning and Uses of Models in Mathematics

and the Empirical Sciences.” Synthese 12: 287-301.

———. 1962. “Models of Data.” In Logic, methodology and philosophy of science: proceedings of the

1960 International Congress, edited by Ernest Nagel, 252-261. Stanford University Press.

Swoyer, Chris. 1987. “The Metaphysics of Measurement.” In Measurement, Realism and

Objectivity, edited by John Forge, 235-290. Reidel.

Tal, Eran. 2011. “How Accurate Is the Standard Second?” Philosophy of Science 78 (5): 1082-

96.

Taylor, John R. 1997. An Introduction to Error Analysis: the Study of Uncertainties in Physical

Measurements. University Science Books.

Thomson, William. 1891. “Electrical Units of Measurement.” In: Popular Lectures and

Addresses, vol. 1, 25-88. London: McMillan.

188

Trenk, Michael, Matthias Franke and Heinrich Schwenke. 2004. “The ‘Virtual CMM’ a

software tool for uncertainty evaluation – practical application in an accredited

calibration lab.” Summer Proceedings of the American Society for Precision Engineering.

Tsai, M.J. and Hung, C.C. 2005. “Development of a high-precision surface metrology system

using structured light projection.” Measurement 38: 236-47.

van Fraassen, Bas C. 1980. The Scientific Image. Oxford: Clarendon Press.

———. 2008. Scientific Representation: Paradoxes of Perspective. Oxford University Press.

Wirandi, J. and Lauber, A. 2006. “Uncertainty and traceable calibration – how modern

measurement concepts improve product quality in process industry.” Measurement 39:

612-20.

Woodward, Jim. 1989. “Data and Phenomena.” Synthese 79: 393-472.

The Epistemology of Measurement: A Model-Based Account · I am especially thankful to Hasok Chang...

Documents

Transcript of The Epistemology of Measurement: A Model-Based Account · I am especially thankful to Hasok Chang...