Frequentist inference only seems easy By John Mount

15
Frequentist estimation only seems easy John Mount Win-Vector LLC 1 Outline First example problem: estimating the success rate of coin flips. Second example problem: estimating the success rate of a dice game. Interspersed in both: an entomologist’s view of lots of heavy calculation. Image from “HOW TO PIN AND LABEL ADULT INSECTS” Bambara, Blinn, http://www.ces.ncsu.edu/depts/ent/notes/ 4H/insect_pinning4a.html 2 This talk is going to alternate between simple probability games (like rolling dice) and the detailed calculations needed to bring the reasoning forward. If you come away with two points from this talk remember: classic frequentist statistics is not as cut and dried as teacher claim (so it is okay to ask questions), and Bayesian statistics is not nearly as complicated as people make it appear. The point of this talk Statistics is a polished field where many of the foundations are no longer discussed. A lot of the “math anxiety” felt in learning statistics is from uncertainty about these foundations, and how they actually lead to common practices. We are going to discuss common simple statistical goals (correct models, unbiasedness, low error) and how the lead to common simple statistical procedures. The surprises (at least for me) are: There is more than one way to do things. The calculations needed to justify how even simple procedures are derived from the goals are in fact pretty involved. 3 A lot of the pain of learning is being told there is only “one way” (when there is more than one) and that a hard step (linking goals to procedures) is easy (when in fact is is hard). Statistics would be easier to teach if those two things were true, but they are not. However, not addressing these issues makes learning statistics harder than it has to be. We are going to spend some time on what are appropriate statistical goals, and how they lead to common statistical procedures (instead of claiming everything is obvious). You won’t be expected do invent the math, but you need to accept that it is in fact hard to justify common statistical procedures without somebody having already done the math. And I’ll be honest I am a math for math’s sake guy.

Transcript of Frequentist inference only seems easy By John Mount

Frequentist estimation only seems easy

John Mount Win-Vector LLC

1

Outline

First example problem: estimating the success rate of coin flips.

Second example problem: estimating the success rate of a dice game.

Interspersed in both: an entomologist’s view of lots of heavy calculation.

Image from “HOW TO PIN AND LABEL ADULT INSECTS” Bambara, Blinn, http://www.ces.ncsu.edu/depts/ent/notes/

4H/insect_pinning4a.html

2 This talk is going to alternate between simple probability games (like rolling dice) and the detailed calculations needed to bring the reasoning forward. If you come away with two points from this talk remember: classic frequentist statistics is not as cut and dried as teacher claim (so it is okay to ask questions), and Bayesian statistics is not nearly as complicated as people make it appear.

The point of this talk

Statistics is a polished field where many of the foundations are no longer discussed.

A lot of the “math anxiety” felt in learning statistics is from uncertainty about these foundations, and how they actually lead to common practices.

We are going to discuss common simple statistical goals (correct models, unbiasedness, low error) and how the lead to common simple statistical procedures.

The surprises (at least for me) are:

There is more than one way to do things.

The calculations needed to justify how even simple procedures are derived from the goals are in fact pretty involved.

3 A lot of the pain of learning is being told there is only “one way” (when there is more than one) and that a hard step (linking goals to procedures) is easy (when in fact is is hard). Statistics would be easier to teach if those two things were true, but they are not. However, not addressing these issues makes learning statistics harder than it has to be. We are going to spend some time on what are appropriate statistical goals, and how they lead to common statistical procedures (instead of claiming everything is obvious). You won’t be expected do invent the math, but you need to accept that it is in fact hard to justify common statistical procedures without somebody having already done the math. And I’ll be honest I am a math for math’s sake guy.

What you will get from this presentation

Simple puzzles that present problems for the common rules of estimating rates.

Good for countering somebody who says “everything is easy and you just don’t get it.”

Examples that expose strong consequences of the seemingly subtle differences in common statistical estimation methods.

Makes understanding seemingly esoteric distinctions like Bayesianism and frequentism much easier.

A taste of some of the really neat math used to establish common statistics.

A revival of Wald game-theoretic style inference (as described in Savage “The Foundations of Statistics”).

4 You will get to roll the die, and we won’t make you do the heavy math. Aside: we have been telling people that one of the things that makes data science easy is large data sets allow you to avoid some of the hard math in small sample size problems. Here we work through some of the math. In practice you do get small sample size issues even in large data sets due to heavy-tail like phenomena and when you introducing conditioning and segmentation (themselves typical modeling steps).

First example: coin flip game5

Why do we even care?

The coin problem is a stand-in for something that that is probably important to us: such as estimating the probability of a sale given features and past experience: P[ sale | features,evidence ].

Being able to efficiently form good estimates that combine domain knowledge, current features and past data is the ultimate goal of analytics/data-science.

6

The coin problem

You are watching flips of a coin and want to estimate the probability p that the coin comes up heads.

For example: "T" "T" "H" "T" "H" "T" "H" "H" "T" “T"

Easy to apply!

Sufficient statistic: 4 heads, 6 tails

Frequentist estimate of p: p ~ heads/(heads+tails) = 0.4

Done. Thanks for your time.

7 # R code set.seed(2014) sample = rbinom(10,1,0.5) print(ifelse(sample>0.5,'H','T'))

Wait, how did we know to do that?

Why is it obvious h/(h+t) is the best estimate of the unknown true value of p?

8 Fundamental problem: a mid-range probability prediction (say a number in the range 1/6 to 5/6) is not falsifiable by a single experiment. So: how do we know such statements actually have empirical content? The usual answers are performance on long sequences (frequentist), appeals to axioms of probability (essentially additivity of disjoint events), and subjective interpretations. Each view has some assumptions and takes some work.

The standard easy estimate comes from frequentism

The standard answer (this example from http://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair ):

7/21/14, 12:26 PMChecking whether a coin is fair - Wikipedia, the free encyclopedia

Page 4 of 8http://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair

Plot of the probability density f(x | H = 7,T = 3) =1320 x7 (1 - x)3 with x ranging from 0 to 1.

is small when compared with the alternative hypothesis (a biased coin). However, it is not small enough tocause us to believe that the coin has a significant bias. Notice that this probability is slightly higher than ourpresupposition of the probability that the coin was fair corresponding to the uniform prior distribution, whichwas 10%. Using a prior distribution that reflects our prior knowledge of what a coin is and how it acts, theposterior distribution would not favor the hypothesis of bias. However the number of trials in this example (10tosses) is very small, and with more trials the choice of priordistribution would be somewhat less relevant.)

Note that, with the uniform prior, the posterior probabilitydistribution f(r | H = 7,T = 3) achieves its peak atr = h / (h + t) = 0.7; this value is called the maximum aposteriori (MAP) estimate of r. Also with the uniform prior,the expected value of r under the posterior distribution is

Estimator of true probability

The best estimator for the actual value is the estimator .

This estimator has a margin of error (E) where at a particular confidence level.

Using this approach, to decide the number of times the coin should be tossed, two parameters are required:

1. The confidence level which is denoted by confidence interval (Z)2. The maximum (acceptable) error (E)

The confidence level is denoted by Z and is given by the Z-value of a standard normal distribution. This

Answer is correct and simple, but not good (as it lacks context, assumptions, goals, motivation and explanation).

Stumper: without an appeal to authority how do we know to use the estimate of heads/(heads+tails). What problem is such an estimate solving (what criterion is it optimizing)?

9 Notation is a bit different: here tau is the unknown true value and p is the estimate. Throughout this talk by “coin” we mean an abstract device that always returns one of two states. For Gelman and Nolan have an interesting article “You Can Load a Die, But You Can’t Bias a Coin” http://www.stat.columbia.edu/~gelman/research/published/diceRev2.pdf about how hard it would be to bias an actual coin that you allow somebody else to flip (and how useless articles testing the fairness of the new Euro were).

Also, there are other common estimates

Examples:

A priori belief: p ~ 0.5 regardless of evidence.

Bayesian (Jeffreys prior) estimate: p ~ (heads+0.5)/(heads+tails+1) = 0.4090909

Laplace smoothed estimate: p ~ (heads+1)/(heads+tails+2) = 0.4166667

Game theory minimax estimates (more on this later in this talk).

The classic frequentist estimate is not the only acceptable estimate.

10 Each of these has its merits. A prior belief has the least sampling noise (as it ignores the data). Bayesian with Jeffreys prior very roughly tries to maximize the amount of information captured in the first observation. Laplace smoothing minimizes expected square error under a uniform prior.

Each different estimate has its own characteristic justification

From “The Cartoon Guide to Statistics” Gonick and Smith.

11 If all of the estimates where “fully compatible” with each other then they would all be identical. Which they clearly are not. Notice we are discussing difference in estimates here- not differences in significances or hypothesis tests. Also Bayesian priors are not always subjective beliefs (Wald in particular used an operational definition).

The standard storyThere are 1 to 2 ways to do statistics: frequentism and maybe Bayesianism.

In frequentist estimation the unknown quantity to be estimated is fixed at a single value and the experiment is considered a repeatable event (with different possible measurements on each possible).

All probabilities are over possible repetitions of experiment with observations changing.

In Bayesian estimation the unknown quantity to be estimated is assumed to have non-trivial distribution and the experimental results are considered fixed.

All probabilities are over possible values of the quantity to be estimated. Priors talk about the assumed distribution before measurement, posteriors talk about the distribution conditioned on the measurements.

12 There are other differences: such as preference of point-wise estimates versus full descriptions of distribution. And these are not the only possible models.

Our coin example again

I flip a coin a single time and it comes up heads- what is my best estimate of the probability the coin comes up heads in repeated flips?

“Classic”/naive probability: 0.5 (independent of observations/data)

Frequentist: 1.0

Bayesian (Jeffreys prior): 0.75

13 Laws that are correct are correct in the extreme cases. (if we have distributed 6-sided dice) Lets try this. Everybody roll your die. If it comes up odd you win and even you lose. Okay somebody who one raise your hand. Each one of you if purely frequentist estimates 100% chance of winning this game (if you stick only to data from your die). Now please put your hands down. Everybody who did not win, how do you feel about the estimate of 100% chance of winning?

What is the frequentist estimate optimizing?

"Bayesian Data Analysis" 3rd Edition,Gelman, Carlin, Stern, Dunson, Vehtari, Rubin p. 92 states that frequentist estimates are designed to be consistent (as the sample size increases they converge to the unknown value), efficient (they tend to minimize loss or expected square-error), or even have asymptotic unbiasedness (the difference in the estimate from the true value converges to zero as the experiment size increases, even when re-scaled by the shrinking standard error of the estimate).

If we think about it: frequentism is interpreting probabilities as limits of rates of repeated experiments. In this form bias is an especially bad form of error as it doesn’t average out.

14 Why not minimize L1 error? Because this doesn’t always turn out to be unbiased (or isn’t always a regression). Bayesians can allow bias. The saving idea: is don’t average estimators, but aggregate data and form a new estimate.

Frequentist concerns: bias and efficiency (variance)

From:“The Cambridge Dictionary of Statistics” 2nd Edition, B.S. Everitt.

Bias:

An estimator for which E[

ˆ✓] = ✓ is said to be unbiased.E�ciency:

A term applied in the context of comparing di↵erent methods of estimating

the same parameter; the estimate with the lowest variance being regarded as

the most e�cient.

15 There is more than one unbiased estimate. For example a grand average (unconditioned by features) is an unbiased estimate.

A good motivation of the frequentist estimate

Adapted from “Schaum’s Outlines Statistics” 4th Edition, Spiegel, Stephens,pp. 204-205.

SAMPLING DISTRIBUTIONS OF MEANSSuppose that all possible samples of size N are drawn without replacement

from a finite population of size Np > N . If we denote the mean and stan-dard deviation of the sampling distribution of means by E[µ̂] and E[�̂] and thepopulation mean and standard deviation by µ and � respectively, then

E[µ̂] = µ and E[�̂] =�pN

sNp �N

Np � 1(1)

If the population is infinite or if sampling is with replacement, the aboveresults reduce to

E[µ̂] = µ and E[�̂] =�pN

(2)

SAMPLING DISTRIBUTIONS OF PROPORTIONSSuppose that a population is infinite and the probability of occurrence of

an event (called its success) is p. ... We thus obtain a sampling distribution of

proportions whose mean E[p̂] and standard deviation E[�̂] are given by

E[p̂] = p and E[�̂] =p(1� p)p

N(3)

16 A very good explanation. Unbiased views of the unknown parameter and its variance are directly observable in the sampling distribution. So you copy the observed values as your estimates. But to our point: frequentism no longer seems so simple. Also close the Bayesian justification: build a complete generative model with complete priors: and then you can copy averages of what you observe.

Why is the frequentist forced to use the estimates 0 and 1?

If the frequentist estimate is to be unbiased for any unknown value of p (in the range 0 through 1) then we must have for each such p:nX

h=0

P[h|n, p]en,h =nX

h=0

✓n

h

◆ph(1� p)n�hen,h = p

The frequentist estimate for each possible outcome of seeing h-heads in n-flips is a simultaneous planned panel of estimates e(n,h) that must satisfy the above bias-check equations for all p.

These check conditions tend to be independent linear equations over our planned estimates e(n,h). So the system has at most one solution and it turns at the solution e(n,h) = h/n works.

Insisting on unbiasedness completely determines the solution.

17 Estimates like 0 and 1 are wasteful in the sense they allow only one-sided errors. Laplace “add one smoothing” puts estimates between likely values (lowering expected l2 error under uniform priors). !The check equations tend to be full rank linear equations in e(n,h) as the p-s generate something very much like the moment curve (which itself is a parameterized curve generating sets of points in general position). !The reason I am showing this is: usually frequentist inference is described as canned procedures (avoiding trigger math anxiety) and Bayesian methods are presented as complicated formulas. In fact you should be as uncomfortable with frequentist methods as you are with Bayesian methods. !\sum_{h=0}^{n} \text{P}[h|n,p] e_{n,h} = \sum_{h=0}^{n} { n \choose h} p^h (1-p)^{n-h} e_{n,h} = p

Argh! That is a lot of painful math.

The math (turning reasonable desiderata to reasonable procedures) has always been hiding there.

You never need to re-do the math to use the use classic frequentist inference procedures (just to derive them).

18 We really worked to h/(h+t) the hard way. The frequentist can’t generate an estimate a single outcome, they must submit a panel of estimates for every possible outcome and then check that the panel represents a schedule of estimates that are simultaneously unbiased for any possible p.

Is the frequentist solution optimal?

It is the only unbiased solution. So it is certainly the most efficient unbiased solution.

What if we relaxed unbiasedness? Are there more efficient solutions?

Yes: consider estimates e(1,h) = (0,1) and b(1,h) = (1/4,3/4)

Suppose loss is: loss(f,n) = E[ E[(f(n,h)-p)^2 | h ~ p,n] | p ~ P[p] ]

P[p] is an assumed prior probability on p, such as P[p] = 1/3 if p=0,1/2,1 and 0 otherwise.

Then: loss(1,b) = 0.0625 and loss(1,e) = 0.25. So loss(1,b) < loss(1,e), you can think of the Bayesian procedures as being more efficient.

But that isn’t fair. Insisting on a prior is adopting the Bayesian’s assumptions as truth. Of course that makes them look better.

19

Frequentist response: you can’t just wish-away bias

Let’s try this lower-loss Bayesian estimate b(1,h) = (0.25,0.75)

Suppose we 50 dice and we record wins as 1, losses as 0.

Suppose in the above experiment there were 50 of us an 8 people won.

Averaging the frequentist estimates: (8*1.0 + 42*0.0)/50 = 0.16 (not too far from the true value 1/6 = 0.1666667).

Averaging the “improved” Bayesian estimates: (8*0.75 + 42*0.25)/50 = 0.33. Way off and most of the error is bias (not mere sampling error).

Bayesian response: you don’t average estimates, you aggregate data and re-estimate. So you treat the group as a single experiment with 8 wins and 42 losses. Estimate is then (8+0.5)/(50+1) = 0.1666667 (no reason for estimate to be dead on, Bayesians got lucky that time).

20 (if they have dice they can run with this, all roll- count and compute). Bayesian response: you don’t average individual estimators, you collect !# R code > set.seed(2014) > sample = rbinom(50,1,1/6) > sum(sample)/length(sample) [1] 0.16 > sum(ifelse(sample>0.5,0.75,0.25))/length(sample) [1] 0.33 > (0.5+sum(sample))/(1+length(sample)) [1] 0.1666667

Second example: dice game21

Dice are a fun example22 Dice are pretty much designed to obey the axioms of naive/

classical probability theory (indivisible events having equal probability). Also once you have a lot of dice it is easy to think in terms of exchangeable repetitions of experiments (frequentist). Given that you will forgive us if we tilt the game towards the Bayesians by adding some hidden state.

The dice game

A control die numbered 1 through 5 is either rolled or placed on one of its sides.

The game die is a fair die numbered 1 through 6. When the game die is rolled the game is a win if the number shown on the game die is greater than the number shown on the control die.

The control die is held at the same value even when we re-roll the game die.

Neither of the dice is ever seen by the player.

23

You only see the win/lose state not the control die or the game die

24 (if we have distributed 6-sided dice) Let’s play a round of this. I’ll hold the control die at 3. You all roll your 6-sided die. Okay everybody who’s die exceeded 1 raise their hands. This time we will group our observations to estimate the “unknown” probability p of winning. What we are looking for is that close to half the room (assuming we have enough people to build a large sample, and that we don’t get incredibly unlucky) have raised their hand. From this you should be able to surmise their are good odds the control die is set at 3, even if you don’t remember what you saw on the control die or what was on your game die.

Multiple plays

The control die is held at a single value and you try to learn the odds by observing the wins/losses reported by repeated rolls of the game die (but not ever seeing either of the dice).

25

The empirical frequentist procedure seems off

After first flip you are forced (by the bias check conditions) to estimate a win-rate of 0 or 1. The with rate is always one of 1/6, 2/6, 3/6, 4/6, or 5/6. So your first estimate is always out of range.

After 5 flips the bias equations no longer determine a unique solution. So you can try to decrease variance without adding any bias. But since your solution is no longer unique, you should have less faith it is the one true solution.

26 Could try Winsorising and using 1/6 as our estimate if we lose and 5/6 as our estimate if we win. But we saw earlier that “tucking in” estimates doesn’t always help (it introduces a bad bias).

How about other estimates?

Can we find an estimator that uses criteria other than unbiasedness without the strong assumption of knowing a favorable prior distribution?

Remember: if we assume a prior distribution (even a so-called uninformative prior) and the assumption turns out to be very far off, then our estimate may be very far off (at least until we have enough data to dominate the priors).

How about a solution that does well for the worst possible selection of the unknown probability p?

We are not assuming a distribution on p, just that it is picked to be worst possible for our strategy.

27

Leads us to a game theory minimax solution

We want an estimate f(n,h) such that:

Where loss(u,v) = (u-v)^2 or loss(u,v) = |u-v|. Here the opponent is submitting a vector p of probabilities of setting the control die to each of its 5 marks. The standard game-theory way to solve this is to find a f(n,h) that works well against the opponent picking a single state of the control die (c) after they see our complete set of estimates. That is:

28 In practice would just use Bayesian methods with reasonable priors. The reduction of one very hard form to another slightly less-hard problem is the core theorem of game theory. Even if you have been taught not to fear long equations, these should look nasty (as they have a lot of quantifiers in them and quantifiers can rapidly increase complexity). f(n,h) is just a panel or vector of n+1 estimate choices for each n. Also once you have things down to simple minimization you essentially have a problem of designing numerical integration or optimal quadrature. !f(n,h) = \text{argmin}_{f(n,h)} \max_{p \in \mathbb{R}^{5}, p \ge 0, 1 \cdot p = 1}\sum_{c=1}^{5} p_c \sum_{k=0}^{n} \text{P}[k \text{ wins} | n,p_c] \times \text{loss}( f(n,k) ,\frac{6-c}{6} ) f(n,h) = \text{argmin}_{f(n,h)} \max_{c \in \{1,\cdots , 5\}} \sum_{k=0}^{n} \text{P}[k \text{ wins} | n,p_c] \times \text{loss}( f(n,k) ,\frac{6-c}{6} )

Wald already systematized this29 If you believe the control die is set by a fair roll, then we again

have a game designed to exactly match a specific generative model (i.e. designed for Bayesian methods to win). If you believe the die is set by an adversary, you again have a game theory problem. Player 1 is trying to maximize risk/loss/error and player 2 is trying to minimize risk. We model the game as both players submitting their strategies at the same time. The standard game theory solution is you pick a strategy so strong that you would do no worse if your opponent peaked at it and then altered their strategy. This is part of a minimax setup. !Wald, A. (1949). Statistical Decision Functions. Ann. Math. Statist., 20(2):165–205.

Wald was very smartOne of his WWII ideas: armor sections of combat planes that you never saw damaged on returning planes. Classical thinking: put armor where you see bullet holes. Wald: put armor where you have never seen a bullet hole (hence never seen a hit survived).

30 Wald could bring a lot of deep math to the table. Wald’s solution allows for many different choices of loss (not just variance or L2) and for probabilistic estimates (i.e. don’t have to return the same estimate every time you see the same evidence, though that isn’t really and advantage).

Our gameIn both cases the loss function is convex, so we expect a unique connected set of globally optimal solutions (no isolated local minima).

For the l1-loss case where loss(u,v) = |u-v| we can solve for the optimal f(n,k) by a linear program.

1-round l1 solution [0.3, 0.7]

2-round l1 solution [0.24, 0.5, 0.76]

For the l2-loss case where loss(u,v) = (u-v)^2 we can solve for the optimal f(n,k) using Newton’s method.

1-round l2 solution [0.25, 0.75]

2-round l2 solution [0.21, 0.5, 0.79]

31 These solutions are profitably exploiting both the boundedness of p (in the range 1/6 through 5/6) and the fact that p only takes one of 5 possible values (though we obviously don’t know which). !How do we pick between l1 and l2 loss? l2 is traditional as it is the next natural moment after the first moment (which becomes the bias conditions). Without the bias conditions l1 loss plausible (and leads to things like quantile regression). l2 has some advantages (such as the gradient structure tending to get expectations right, hence helping enforce regression conditions and reduce bias).

Another game

Suppose the opponent can pick any probability for a coin (they are not limited to 1/6,2/6,3/6,4/6,5/6).

In this case we want to pick f(n,h) minimizing:

32 M(n,f(n,h)) = \max_{p \in [0,1]} \sum_{k=0}^{n} \text{P}[k \text{ wins} | n,p] \times \text{loss}( f(n,k) ,p )

The general p l2 minimax solutions

For the l1-loss case where loss(u,v) = |u-v| we have a convex program with a different linear constraint for each possible p. A column generating strategy over a LP solver handles this quite nicely.

For the l2-loss case where loss(u,v) = (u-v)^2 the solution is:

heads +pheads + tails/2

heads + tails +pheads + tails

33 Savage, L. J. (1972). The Foundations of Statistics. Dover cites this solution as coming from Hodges, J. L., J. and Lehmann, E. L. (1950). Some problems in minimax point estimation. The Annals of Mathematical Statistics, 21(2):pp. 182–197. !see http://winvector.github.io/freq/minimax.pdf for details !\frac{\text{heads} + \sqrt{\text{heads} + \text{tails}}/2}{\text{heads} + \text{tails} + \sqrt{\text{heads}+\text{tails}}}

How can you solve the l2 minimax problem?

For every n there is a f(n,h) (essentially a table of n+1 estimates) such that L(n,f(n,h),p) = g(n) where g(n) is free of p. And further: the partial derivative of L(n,,) with respect to any of the entries of f(n,h) evaluated at this f(n,h) are not p-free. In fact there are always p-s that allow us to freely choose the sign of this gradient.

Enough to claim:

Define:L(n, f(n, h), p) =

nX

k=0

P[k wins|n, p]⇥ (f(n, k)� p)2

Examples:

L(1,(1/4,3/4),p) = 1/16

L(2,(-1/2 + sqrt(2)/2,1/2,-sqrt(2)/2 + 3/2),p) = -sqrt(2)/2 + 3/4

argminf(n,h) max

pL(n, f(n, h), p) = rootf(n,h)L(n, f(n, h), p)� f(n, 0)2

34 L(n,f(n,h),p) = \sum_{k=0}^{n} \text{P}[k \text{ wins} | n,p] \times ( f(n,k)-p )^2 \text{argmin}_{f(n,h)} \max_p L(n,f(n,h),p) = \text{root}_{f(n,h)} L(n,f(n,h),p) - f(n,0)^2 !We know L(n,f(n,h),p) is convex in f(n,h), so max_p L(n,f(n,h)) is also convex in f(n,h). We are not looking at the usual Karush–Kuhn–Tucker conditions of optimality. What I think is going on is M(n,f(n,h)) = max_p L(n,f(n,h),p) is majorized by L(,,), so we are collecting evidence of the optimal point through p. What is exciting is we get rid of quantifiers, making the problem much easier. !See http://winvector.github.io/freq/explicitSolution.html and https://github.com/WinVector/Examples/blob/master/freq/python/explicitSolution.rst for more details.

The l2 minimax solution in a graph

Solution of the form L(1,(lambda,1-lambda),p).

Notice best minimax solution is at f(1,h) = (0.25,0.75).

Notice all p-curves cross there.

Also notice if you move from 0.25, you can always find a p that makes things worse.

This proves the solution is a local minima, so by convexity it is also the global optimum.

35 So it is just a matter of checking the stated solution clears the p’s out of L(k,,p). Leonard J. Savage gives this example on page 203 of the 1972 edition of “The Foundations of Statistics.” He attributes it to: “Some Problems in Minimax Point Estimation” J L Hodges and E L Lehmann, The Annals of Mathematical Statistics, 1950 vol. 21 (2) pp. 182-197.

A few exact l1/l2 solutions

1-round l2 solution: (1/4, 3/4) (also the 1-round l2 solution)

2-round l2 solution: (-1/2 + sqrt(2)/2, 1/2, -sqrt(2)/2 + 3/2)

~ (0.207, 0.5, 0.793)

Not the same as the 2-round l1 solution: (0.192, 0.5, 0.808)

36 Again this game is to build a best l1 or l2 estimate for any p in the range 0 through 1. Each estimate is biased (as they don’t agree with the traditional empirical frequentist estimate), but the bias is going down as n goes up. Also these estimates are not the traditional Bayesian ones as they don’t three with anything coming from traditional priors (notice the non-rational values). These are related to what Wald called “logical Bayes” where the Bayesian method is used, but we don’t insist on priors (but instead solve a minimax problem- where we try to do well under worst-possible initial distributions).

Table of estimates1/1

1/2

2/2

2/3

3/3

2/4

3/4

4/4

3/5

4/5

5/5

3/6

4/6

5/6

6/6

4/7

5/7

6/7

7/7

4/8

5/8

6/8

7/8

8/8

5/9

6/9

7/9

8/9

9/9

5/10

6/10

7/10

8/10

9/10

10/10

0/1 0/2

1/2

0/3

1/3

0/4

1/4

2/4

0/5

1/5

2/5

0/6

1/6

2/6

3/6

0/7

1/7

2/7

3/7

0/8

1/8

2/8

3/8

4/8

0/9

1/9

2/9

3/9

4/9

0/10

1/10

2/10

3/10

4/10

5/10

1/1

1/2

2/2

2/3

3/3

2/4

3/4

4/4

3/5

4/5

5/5

3/6

4/6

5/6

6/6

4/7

5/7

6/7

7/7

4/8

5/8

6/8

7/8

8/8

5/9

6/9

7/9

8/9

9/9

5/10

6/10

7/10

8/10

9/10

10/10

0/1

0/2

1/2

0/3

1/3

0/4

1/4

2/4

0/5

1/5

2/5

0/6

1/6

2/6

3/6

0/7

1/7

2/7

3/7

0/8

1/8

2/8

3/8

4/8

0/9

1/9

2/9

3/9

4/9

0/10

1/10

2/10

3/10

4/10

5/10

1/1

1/2

2/2

2/3

3/3

2/4

3/4

4/4

3/5

4/5

5/5

3/6

4/6

5/6

6/6

4/7

5/7

6/7

7/7

4/8

5/8

6/8

7/8

8/8

5/9

6/9

7/9

8/9

9/9

5/10

6/10

7/10

8/10

9/10

10/10

0/1

0/2

1/2

0/3

1/3

0/4

1/4

2/4

0/5

1/5

2/5

0/6

1/6

2/6

3/6

0/7

1/7

2/7

3/7

0/8

1/8

2/8

3/8

4/8

0/9

1/9

2/9

3/9

4/9

0/10

1/10

2/10

3/10

4/10

5/10

1/1

1/2

2/2

2/3

3/3

2/4

3/4

4/4

3/5

4/5

5/5

3/6

4/6

5/6

6/6

4/7

5/7

6/7

7/7

4/8

5/8

6/8

7/8

8/8

5/9

6/9

7/9

8/9

9/9

5/10

6/10

7/10

8/10

9/10

10/10

0/10/2

1/2

0/3

1/3

0/4

1/4

2/4

0/5

1/5

2/5

0/6

1/6

2/6

3/6

0/7

1/7

2/7

3/7

0/8

1/8

2/8

3/8

4/8

0/9

1/9

2/9

3/9

4/9

0/10

1/10

2/10

3/10

4/10

5/10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10n

phi

estName

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Bayes (Jeffreys)

Frequentist

l1 minimax

l2 minimax

37 For each of the four major estimates we discussed we show the chosen estimate phi for h-heads out of n-flips. In general frequentist is outside, Bayes, which is outside l1 minimax which is outside l2 minimax. l1 and l2 interior solutions are very close. This is a graph of a ready to go decision table (an user could forget everything up until here and just pick their phis off the graph). Notice frequentist solution crosses l2 minimax around n=8. Also all solutions except l1 minimax are equally spaced when n is held fixed. For more details see: https://github.com/WinVector/Examples/blob/master/freq/python/freqMin.rst

Or: consider this table no easier to use …

"Frequentist" h n 0 1 2 3 4 5 1 0.0000000 1.0000000 2 0.0000000 0.5000000 1.0000000 3 0.0000000 0.3333333 0.6666667 1.0000000 4 0.0000000 0.2500000 0.5000000 0.7500000 1.0000000 5 0.0000000 0.2000000 0.4000000 0.6000000 0.8000000 1.0000000

38 Obviously you don’t need the table for frequentist as h/(h+t) is easy to remember.

than to use:

"l2 minimax" h n 0 1 2 3 4 5 1 0.2500000 0.7500000 2 0.2071068 0.5000000 0.7928932 3 0.1830127 0.3943376 0.6056624 0.8169873 4 0.1666667 0.3333333 0.5000000 0.6666667 0.8333333 5 0.1545085 0.2927051 0.4309017 0.5690983 0.7072949 0.8454915

39 And the point is: depending on your goals this table might be the one you want. However, be warned the l2 minimax adding of sqrt(n) pseudo-observations is an uncommon procedure. You want to check if you really want that.

And that is it40

What to take away

Deriving or justifying optimal inference techniques on even simple dice games can bring in a lot of heavy calculation. If you don’t find that worrying, then you aren’t paying attention.

For standard situations statisticians did the heavy calculations a long time ago and packaged up good and simple procedures (the justifications are difficult, but you don’t have to repeat the justifications each time you apply the methods).

Unbiasedness is just one desirable property among many. If you accept it is required you are often forced to accept traditional empirical frequentists estimates as only possible and best possible (not always a good thing).

Differences in Bayesian and frequentist assumptions lead not only to different hypothesis testing paradigms (confidence intervals versus credible intervals)- they also pick different “optimal” estimates. Best answer depends on your use case (not your sense of style).

41

Thank you

42

Links

!iPython notebook of most of these results/graphs: https://github.com/WinVector/Examples/blob/master/freq/python/freqMin.rst !More on this topic: http://www.win-vector.com/blog/2014/07/frequenstist-inference-only-seems-easy/ http://www.win-vector.com/blog/2014/07/automatic-bias-correction-doesnt-fix-omitted-variable-bias/ !For more information please try our blog: http://www.win-vector.com/blog/ and our book “Practical Data Science with R” http://practicaldatascience.com . !Please contact us with comments, questions, ideas, projects at: [email protected]

43 ipython notebook working through all these examples https://github.com/WinVector/Examples/blob/master/freq/python/freqMin.rst