Jan Conrad NuFACT 06 25 August 2006 1 Some comments on ”What (should) sensitivity estimates mean...

Jan Conrad NuFACT 06 25 August 2006 1

Some comments on ”What (should) sensitivity estimates mean ?”

Jan Conrad

Royal Institute of Technology (KTH)

Stockholm


Outline

Definitions of “sensitivity” confidence intervals/p-values with

systematic uncertainties

Averaging Profiling

An illustration Remarks on the ensemble Summary/recommendations

The aim of this talk is

to confuse as much as

necessary but as little

as possible


Definition of ”sensitivity” - I

1 (well known HEP-statistics expert from Oxford)

Median upper limit obtained from repeated experiment with no signal. in a two dimensional problem keep one parameter fixed.

2 (fairly well known HEP statistics expert from Germany)

Mean result of whatever the quantity we want to measure, for example 90 % confidence intervals, mean being taken over identical replica of the experiment.

3 (less well known HEP statistics expert from Italy)

Look at that paper I wrote in arXiv:physics. Nobody has used it but it is the best definition .....


Definition of sensitivity -II

Definition using p-values (hypothesis test):

The experiment is said to be sensitive to a given value of the parameter Θ13 = Θ13

sens at signficance level α if the mean p-value obtained given Θ13

sens is smaller than α .

The p-value is (per defintion) calculated given zero hypothesis Θ13 = 0:

test statistics, T, could be for example χ2

Actually observed value of the test statistics


Definition of sensitivity –III(what nuFact people most often use ?)

Definition using confidence intervals (CI ) 1

The experiment is said to be sensitive to a given value of the parameter Θ13 = Θ13

sens at signficance level α if the mean2 1-α CI obtained, given Θ13

sens , does not contain Θ13 = 0.

1) This means using confidence intervals for hypothesis testing. I think I convinced myself, that the approaches are equivalent, but .....

2) some people prefer median .... (because the median is invariant under parameter transformations)


So what ?

Once we decided on the definition of sensitivity,

two problems need to be addressed:

What method should be used to calculate the CI or the p-value ?

Since the experiment does not exist, what is the ensemble of experiments we use to calculate the mean (or other quantities) ?


P-values and the Neyman Pearson lemma

Uniformly most powerful test statistic:

To calculate p-values, we need to know the null-distribution of T.

Therefore it comes handy that asymptotically:

Remember:


Example: practical calculation using p-values

Simulate an observation were Θ13 >0. Fit a model with Θ13 = 0 and a model with Θ13 >0 then:

δχ2 is (under certain circumstances) χ2 distributed.

For problems with these approach Luc Demortier: "P-Values: What They Are and How to Use Them", draft report presented at the BIRS workshop on statistical inference problems in high energy physics and astronomy, July 15-20, 2006.


Some methods for p-value calculation

Conditioning Prior-predictive Posterior-predictive Plug-In Likelihood Ratio Confidence Interval Generalized frequentist

I will not talk about these any more.


Some methods for confidence interval calculation (the Banff list)

Bayesian Feldman & Cousins with Bayesian treatment of

nuisance parameters Profile Likelihood Modified Likelihood Feldman & Cousins with Profile Likelihood Fully frequentist Empirical Bayes

I will talk a little bit about this one


Properties I: Coverage

A method is said to have coverage (1-α) if, in infinitely many repeated experiments the resulting CIs include (cover) the true value in a fraction (1-α) of all cases (irrespective of what the true value is).

1 -α

s

1

0.9over-covering

under-covering


Properties II:Type I, type II error and power

Type I error: Reject H0, though it is true.

Prob(Type I error) = α (corresponds to coverage for hyp. tests)

Type II error: Accept H0, though it is false

Power β = 1 – Prob(Type II error) Given H1, what is the probability that we will reject H0 at given

significance α ?


Nuisance parameters

Nuisance parameters are parameters which enter the data model, but which are not of prime interest. Example background:

You don’t want to give CIs (or p-values) dependent

on nuisance parameters need a way to get rid of

them


How to treat nuisance parameters ?

There is a wealth of approaches to dealing with nuisance parameters. Two are particularly common:

Averaging

No time to discuss this, see:

Profiling

Example which I will present here:

Profile Likelihood/MINUIT (which is similar to what many of

you have been doing)

J.C et. al. Phys. Rev D67:012002,2003 J.C & F. Tegenfeldt , Proceedings PhyStat 05, physics/0511055F. Tegenfeldt & J.C. Nucl. Instr. Meth.A539:407-413, 2005

Bayesian


Profile Likelihood Intervals

Lower limit Upper Limit

2.706

meas n, meas. b MLE of b given s

MLE of b and s given observations

To extract limits:


From MINUIT manual

See F. James, MINUIT Reference Manual, CERN Library Long Write-up D506, p.5:

“The MINOS error for a given parameter is defined as the change in the value of the parameter that causes the F’ to increase by the amount UP, where F’ is the minimum w.r.t to all other free parameters”.

Confidence

Interval

Profile Likelihood ΔΧ2 = 2.71 (90%),

ΔΧ2 = 1.07 (70 %)


Coverage of profile likelihood

Background: Poisson (unc ~ 20 % -- 40 %) , Efficiency: binomial (unc ~ 12%) Rolke

et alMinuit

W. Rolke, A. Lopez, J.C. Nucl. Inst.Meth A 551 (2005) 493-503

(1-

α) M

C

true s


Confidence Intervals for new particle searches at LHC?

Basic idea: calculate 5 σ confidence interval and claim discovery if s = 0 is not included.

Straw-man model:

Observed in signal region

Obs

erve

d in

bac

kgro

und

regi

on

K. S. Cranmer, Proceedings PhyStat 2005

Sideband of size τ

- Bayesian under-covers badly (add 16 events to get correct significance)

- Profile is the only method considered here which gives coverage (exc. full construction)

ProfileBayesian


The profile likelihood and the χ2

The most common method in neutrino physics seems to be minimizing a χ2

Assume Likelihood function:

Omitting terms not dependent on parameter:

χ2 fit asymptotically equivalent to

profile likelihood if you minimize w.r.t

nuisance parameters

Exact asymptotically


A simple example calculation.

Model generating the data:

This means: in each experiment you measure n and bmeas, given s and b. σb is assumed to be known.

In what follows I use the χ2 to calculate a p-value (not a confidence interval)


Two approaches using χ2

Adding the uncertainty in quadrature:

Allowing for a nuisance parameter (background normalisation) and minimize with respect to the nuisance parameter:

...seems to be quite common ...

Similar to what is used in for example:

Burguet-Castell et. al.Nucl.Phys.B725:306-326,2005 (Beta-beams at CERN SPS)


Coverage (type I error)

Nominal χ2:

what you assume is the correct null distribution

Ignore/Profile/Quad. add etc:

”real” null distributions of what you call a χ2

Empirical:

”true” χ2 distribution...... to the extent you trust ROOT.....


What if we have only Gaussian processes ?


Which method is more sensitive to signal ? Power


Power and sensitivity ?

In most cases I saw, an average result is presented This tells you very little about the probability that a

given signal will yield a significant observation (power)

Shot at ”What should sensitivity mean ?”:

An experiment is sensitive to a finite value Θ of a parameter if the probability of obtaining an observation n which rejects Θ = 0 with at least significance α is at least β.


What is the ensemble ....

..... .of repeated experiments which

I should use to calculate the ”mean” (or the probability β) in the sensitivity calculation ?

I should use to calculate the coverage ?


My answer ...... .... both ensembles should be the same ....

Each pseudo-experiment: has fixed true values of the prime parameter and the nuisance nuisance

parameters. yields a prime measurement (e.g. Number of observed events),

yields one estimate for each nuisance parameter (e.g. background)1)

This estmate might come from auxiliary measurements in the same or other detectors or from theory. In the former case, care has to be taken that the measurement procedure is replicated as in the real experiment.

In case of theoretical uncertainties, there is no real ”measurement process”. I would argue even theoretical uncertainties should be treated as there was a true value and an estimate, which we pretent is a random variable.

1) Shape and size of uncertainties known beforehand ? Otherwise generalize .....


Update ”what should sensitivity mean ?”

An experiment is sensitive to a finite value Θ of a parameter if the probability of obtaining an observation n which rejects Θ = 0 with at least significance α is at least β.

The probability is hereby evaluated using replica of the experiment with

fixed true parameter Θ and fixed nuisance parameters. The random variables in this experiment are thus the observation n and the estimates of the nuisance parameter.

The significance of the observation n hereby evaluated using replica of the experiment with fixed true parameter Θ = 0 and fixed nuisance parameters (assuming a p-value procedure, otherwise by CI)


Unscientific study of 12 papers dealing with sensitivities to oscillation parameters

0 papers seem to worry about the ensemble w.r.t which the ”mean” is calculated 0 papers check the statistical validity of the χ2 used 3 papers treat systematics and write down explicitely what χ2 used (or give enough

information to reconstruct it in principle ) 6 papers ignore the systematics or don’t say how they are included in the fit 2 of the papers don’t say how signficance/CIs are calculated 1 paper doesn’t even tell me the signficance level

No paper is doing what I would like best, ¼ of the papers are in my opinion acceptable with some goodwill, ¾ of the papers I would reject. Binomial errors on these figures are neglected

How do things look in the nuFact community ?


Summary/Recommendations

More is more !

Include systematics in your calculation (or discuss why you neglect them) ...not just neglect them ....

Report under which assumptions the data is generated. Report the test statistic you are using explicetly.

What does ”mean” mean ?

I did not encounter any discussion of neither the power of your sensitivity analysis nor of the ensemble of experiments which is used for the ”average”.


Summary con’d

And the winner is ....

Most of the papers have been using a χ2 fit. If you include nuisance parameters in those and minimize w.r.t them, this is equivalent to a Profile Likelihood approach for strictly Gaussian processes. Otherwise asymptotically equivalent. This approach seems to provide coverage in many even unexpected cases.

Don’t think, compute .....

Given the computer power available (and since the stakes are high), I think for sensitivity studies comparing different experimental configurations, there is no reason to stick slavely to the native χ2 distribution instead of doing a Toy MC to construct the distribution of test statistics yourself.

The thinking part is to choose the ensemble of experiments to simulate.


And a last one ....for the customer

Best is not necessarily best.

The intuitive (and effective) result of including systematics (or doing a careful statistical analysis instead of a crude one) is to worsen the calculated sensitivity. If I was to spend XY M$ on an experiment, I would insist to understand how the sensitivity is calculated in detail.

Otherwise, if anything, I would give the XY M$ to the group with the worse sensitivity but the more adequate calculation.


List of relevant references.

G. Feldman & R. Cousins. Phys.Rev.D57:3873-3889,1998 THE method for confidence interval calculation

J.C et. al. Phys. Rev D67:012002,2003 combining FC with Bayesian treatment of systematics.

J.C & F. Tegenfeldt , Proceedings PhyStat 05, Oxford, 2005, physics/0511055 combined experiments, power calculations for CIs with Bayesian treatment of systematics

F. Tegenfeldt & J.C. Nucl. Instr. Meth.A539:407-413, 2005 coverage of CI intervals.

L. Demortier: presented at the BIRS workshop on statistical inference problems in high energy physics and astronomy, July 15-20, 2006.

all you do want to know about p-values, but don’t dare to ask W.Rolke, A. Lopez and J.C, Nucl. Inst. Meth A. 551 (2005) 493-503

Profile likelihood and its coverage K. S. Cranmer, Proceedings PhysStat 05.

Signficance calculation for the LHC F. James, Computer Phys. Comm. 20 (1980) 29- 35

profile likelihood without calling it that G. Punzi, Proceedings of Phystat 2003, SLAC, Stanford (2003)

a defintion of sensitivity including power S. Baker and R. Cousins

Likelihood and χ2 in fits to histograms. J. Burguet-Castell et. al. Nucl.Phys.B725:306-326,2005

Example of a rather reasonable sensitivity calculation in neutrino physics (random pick, there are certainly others …maybe even better).

R. Barlow, .. J.C. et. al “The Banff comparison of methods to calculate Confidence Interval” Systematic comparison of confidence interval methods, to be published beginning 2007.


Backups


What if we have 20 % unc ?


Added uncertainty in efficiency.


Requirements for χ2

Gauss distribution: N(s,s)

Hypothesis linear in parameters (so for example

”χ2 ” = (n-s2)/s does’nt work)

Functional form for the hypothesis is correct.

Jan Conrad NuFACT 06 25 August 2006 1 Some comments on ”What (should) sensitivity estimates mean...

Documents

Transcript of Jan Conrad NuFACT 06 25 August 2006 1 Some comments on ”What (should) sensitivity estimates mean...