Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

15
1 Learning to Estimate Without Bias Tzvi Diskin, Yonina C. Eldar and Ami Wiesel Abstract—We consider the use of deep learning for pa- rameter estimation. We propose Bias Constrained Estimators (BCE) that add a squared bias term to the standard mean squared error (MSE) loss. The main motivation to BCE is learning to estimate deterministic unknown parameters with no Bayesian prior. Unlike standard learning based estimators that are optimal on average, we prove that BCEs converge to Minimum Variance Unbiased Estimators (MVUEs). We derive closed form solutions to linear BCEs. These provide a flexible bridge between linear regrssion and the least squares method. In non-linear settings, we demonstrate that BCEs perform similarly to MVUEs even when the latter are computationally intractable. A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance. Examples include distributed sensor networks and data augmentation in test-time. In such appli- cations, unbiasedness is a necessary condition for asymptotic consistency. I. I NTRODUCTION Parameter estimation is a fundamental problem in many areas of science and engineering. The goal is to recover an unknown deterministic parameter y given realizations of random variables x whose distribution depends on y. An estimator ˆ y(x) is typically designed by minimizing some distance between the observed variables and their statistics, e.g., least squares, maximum likelihood estimation (MLE) or Method of Moments (MoM) [Kay and Kay, 1993]. Their performance is often measured in terms of mean squared error (MSE) and bias that depend on the unknown. MLE is asymptotically a Minimum Variance Unbiased Estimator (MVUE) for any value of y. In the last decade, there have been many works suggesting to design estimators using deep learning. The latter provide a promising tradeoff between computational complexity and accuracy, and are useful when classical methods as MLE are intractable. Learning based estimators directly minimize the average MSE with respect to a given dataset. Their performance may deteriorate with respect to other values of y. To close this gap, we propose Bias Constrained Estimators (BCE) where we add a squared bias term to the learning loss in order to ensure unbiasedness for any value of y. The starting point to this paper is the definitions of MSE and bias which differ between fields. Classical statistics define the metrics as a function of the unknown y (see for example Eq. 2.25 in [Friedman, 2017], Chapter 2 in. [Kay and Kay, 1993] or [Lehmann and Scheffe, 1948]): BIAS ˆ y (y)=E[ ˆ y (x)] - y VAR ˆ y (y)=E h k ˆ y(x) - E[ˆ y (x)]k 2 i MSE ˆ y (y)=E h k ˆ y(x) - yk 2 i (1) The “M” in the MSE and in all the expectations in (1) are with respect to to the distribution of x parameterized by y. The unknowns are not integrated out and the metrics are functions of y. The goal of parameter estimation is to minimize the MSE for any value of y. This problem is ill- defined as different estimators are better for different values of y. Therefore, most of this field focuses on minimizing the MSE only among unbiased estimators, that have zero bias for any value of y. An estimator which is optimal in this sense is called an MVUE. As explained above, the popular MLE is indeed asymptotically MVUE. Learning theory and Bayesian statistics also often use the acronym MSE with a slightly different definition (see for example Chapter 10 in [Kay and Kay, 1993] or Chapter 2.4 in [Friedman, 2017]). Bayesian methods model y as a random vector and use the expectation of MSE(y) with respect to its prior. To avoid confusion, we refer to this metric as BMSE : BMSE ˆ y = E [MSE ˆ y (y)] (2) Unlike MSE, BMSE is not a function of y. BMSE has a well defined minima but it is optimal on average with respect to y and depends on the chosen prior. In the last years, there is a growing body of works on learning based approximations to MLE when it is computationally intractable [Ongie et al., 2020], [Schlemper et al., 2018], [Dong et al., 2015], arXiv:2110.12403v1 [cs.LG] 24 Oct 2021

Transcript of Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

Page 1: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

1

Learning to Estimate Without BiasTzvi Diskin, Yonina C. Eldar and Ami Wiesel

Abstract—We consider the use of deep learning for pa-rameter estimation. We propose Bias Constrained Estimators(BCE) that add a squared bias term to the standard meansquared error (MSE) loss. The main motivation to BCE islearning to estimate deterministic unknown parameters withno Bayesian prior. Unlike standard learning based estimatorsthat are optimal on average, we prove that BCEs converge toMinimum Variance Unbiased Estimators (MVUEs). We deriveclosed form solutions to linear BCEs. These provide a flexiblebridge between linear regrssion and the least squares method.In non-linear settings, we demonstrate that BCEs performsimilarly to MVUEs even when the latter are computationallyintractable. A second motivation to BCE is in applicationswhere multiple estimates of the same unknown are averagedfor improved performance. Examples include distributed sensornetworks and data augmentation in test-time. In such appli-cations, unbiasedness is a necessary condition for asymptoticconsistency.

I. INTRODUCTION

Parameter estimation is a fundamental problem in manyareas of science and engineering. The goal is to recoveran unknown deterministic parameter y given realizations ofrandom variables x whose distribution depends on y. Anestimator y(x) is typically designed by minimizing somedistance between the observed variables and their statistics,e.g., least squares, maximum likelihood estimation (MLE)or Method of Moments (MoM) [Kay and Kay, 1993]. Theirperformance is often measured in terms of mean squarederror (MSE) and bias that depend on the unknown. MLEis asymptotically a Minimum Variance Unbiased Estimator(MVUE) for any value of y. In the last decade, there havebeen many works suggesting to design estimators using deeplearning. The latter provide a promising tradeoff betweencomputational complexity and accuracy, and are useful whenclassical methods as MLE are intractable. Learning basedestimators directly minimize the average MSE with respectto a given dataset. Their performance may deteriorate withrespect to other values of y. To close this gap, we proposeBias Constrained Estimators (BCE) where we add a squared

bias term to the learning loss in order to ensure unbiasednessfor any value of y.

The starting point to this paper is the definitions of MSEand bias which differ between fields. Classical statisticsdefine the metrics as a function of the unknown y (seefor example Eq. 2.25 in [Friedman, 2017], Chapter 2 in.[Kay and Kay, 1993] or [Lehmann and Scheffe, 1948]):

BIASy(y) = E [y (x)]− y

VARy(y) = E[‖y(x)− E [y (x)]‖2

]MSEy(y) = E

[‖y(x)− y‖2

](1)

The “M” in the MSE and in all the expectations in (1) arewith respect to to the distribution of x parameterized byy. The unknowns are not integrated out and the metricsare functions of y. The goal of parameter estimation is tominimize the MSE for any value of y. This problem is ill-defined as different estimators are better for different valuesof y. Therefore, most of this field focuses on minimizing theMSE only among unbiased estimators, that have zero biasfor any value of y. An estimator which is optimal in thissense is called an MVUE. As explained above, the popularMLE is indeed asymptotically MVUE.

Learning theory and Bayesian statistics also often usethe acronym MSE with a slightly different definition (seefor example Chapter 10 in [Kay and Kay, 1993] or Chapter2.4 in [Friedman, 2017]). Bayesian methods model y as arandom vector and use the expectation of MSE(y) withrespect to its prior. To avoid confusion, we refer to thismetric as BMSE :

BMSEy = E [MSEy(y)] (2)

Unlike MSE, BMSE is not a function of y. BMSE has a welldefined minima but it is optimal on average with respect toy and depends on the chosen prior.

In the last years, there is a growing body of workson learning based approximations to MLE when itis computationally intractable [Ongie et al., 2020],[Schlemper et al., 2018], [Dong et al., 2015],

arX

iv:2

110.

1240

3v1

[cs

.LG

] 2

4 O

ct 2

021

Page 2: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

2

[Diskin et al., 2017], [Yang et al., 2019],[Naimipour et al., 2020], [Izacard et al., 2019],[Dreifuerst and Heath Jr, 2021]. This trend opens a windowof opportunities for low cost and accurate estimators. Yet,there is subtle issue with the MSE definitions. Deep learningbased algorithms minimize the BMSE and are not optimalfor specific values of y. Partial solutions to this issueinclude uninformative priors in special cases [Jaynes, 1968]and minmax approaches that are often too pessimistic[Eldar and Merhav, 2005]. To close this gap, BCE tries tominimize the BMSE while promoting unbiasedness. Weprove that BCE converges to an MVUE. Namely, given arich enough architecture and a sufficiently large numberof samples, BCE is asymptotically unbiased and achievesthe lowest possible MSE for any value of y. Numericalexperiments clearly support the theory and show that BCEleads to near-MVUE estimators for all y. Estimators basedon BMSE alone are better on average but can be bad forspecific values of y.

To gain more intuition into BCE we consider the casesof linear networks and linear generative models. In thesesettings, BCE has a closed form solution which summarizesthe story nicely. On the classical side, the MVUE in linearmodels is the Least Squares estimator (which is also theGaussian MLE). On the Bayesian machine learning side,the minimum BMSE estimator is known as linear regression.By penalizing the bias, BCE allows us to learn MVUE-likeregressors. The formulas clearly illustrate how BCE reducesthe dependency on the training prior. Indeed, we also showthat BCE can be used as a regularizer against overfitting ininverse problems where the signal to noise ratio (SNR) islarge but the number of available labels is limited.

A second motivation to BCE is in the context of averagingestimators in test time. In this setting the goal is to learn asingle network that will be applied to multiple inputs andthen take their average as the final output. This is the case,for example, in a sensor network where multiple independentmeasurements of the same phenomena are available. Eachsensor applies the network locally and sends its estimate toa fusion center. The global center then uses the average ofthe estimates as the final estimate [Li and AlRegib, 2009].Averaging in test-time has also become standard in imageclassification where the same network is applied to multiplecrops at inference time [Krizhevsky et al., 2012]. In suchsettings, unbiasednees of the local estimates is a goal on itsown, as it is a necessary condition for asymptotic consistency

of the global estimate. BCE enforces this condition andimproves accuracy. We demonstrate this advantage in thecontext of data augmentation in test time on the CIFAR10image classification dataset.

A. Related Work

Deep learning for estimation: Deep learning basedestimators have been proposed for inverse problems[Ongie et al., 2020], robust regression [Diskin et al., 2017],SNR estimation [Yang et al., 2019], phase retrieval[Naimipour et al., 2020], frequency estimation[Izacard et al., 2019], [Dreifuerst and Heath Jr, 2021]and more. In contrast to standard data-driven machinelearning, these networks are trained on synthetic datagenerated from a well specified probabilistic model. Theunknown parameters do not follow any known prior, buta fictitious label prior pfake(y) is devised in order togenerate a training set. The resulting estimators are optimalon average with respect to the fake prior, but have noguarantee otherwise. BCE directly addresses these worksand allows to achieve near optimal performance among allunbiased estimators for arbitrary parameters (independentlyof the chosen fake prior).

Fairness: BCE is intimately related to the topics of “fair-ness” and “out of distribution (OOD) generalization” in ma-chine learning. The topics are related both in the terminologyand in the solutions. Fair learning tries to eliminate biasesin the training set and considers properties that need to beprotected [Agarwal et al., 2019]. OOD works introduce anadditional “environment” variable and the goal is to train amodel that will generalize well on new unseen environments[Creager et al., 2021], [Maity et al., 2020]. Among the pro-posed solutions are distributionally robust optimization[Bagnell, 2005] which is a type of minmax estimator, as wellas invariant risk minimization [Arjovsky et al., 2019] andcalibration constraints [Wald et al., 2021], both of which arereminiscent of BCE. A main difference is that in our workthe protected properties are the labels themselves. Anothercore difference is that in parameter estimation we assumefull knowledge of the generative model, whereas the aboveworks are purely data-driven and discriminative.

Averaging in test-time: Averaging of multiple es-timates in test-time is used for different applicationswhere centralized estimation is difficult or not possi-ble. This includes distributed estimation due to largedatasets [Zhang et al., 2013], aggregation of sequential

Page 3: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

3

measurements in object tracking [Ho, 2012], averagingscores on different crops and flips in image classification[Krizhevsky et al., 2012], [Simonyan and Zisserman, 2014]or detection confidence of the same object on differentframes in video object detection [Han et al., 2016]. A keyidea for improving the accuracy of the averaged esti-mate is bias reduction of the local estimators [Ho, 2012],[Luengo et al., 2015]. BCE gives the ability to use thisconcept in the framework of deep learning based estimators.

B. Notation

We use regular symbols for scalars, bold symbols forvector and bold capital letters for matrices. Throughout thepaper, we use conditional sampling: First the parametersyi are chosen or randomly generated. Then, measurementsxj(yi) are generated using yi. To simplify the expressions,we use the following notations for empirical means:

EM [f(x,y)|y] =1

M

M∑j=1

f(xj(y),y)

EN [f(y)] =1

M

N∑i=1

f(yi)

ENM [f(x,y)] =1

MN

N∑i=1

M∑j=1

f(xj(yi),yi), (3)

where f is an arbitrary function.

II. BIASED CONSTRAINED ESTIMATION

Consider a random vector x whose probability distributionis parameterized by an unknown deterministic vector y insome region S ⊂ Rn. We are interested in the case in whichy is a deterministic variable without any prior distribution.We assume that x depends on y, either via a likelihoodfunction p (x;y) or a generative model x = G(y) where Gis a (possibly stochastic) known transformation. Given thisinformation, our goal is to estimate y given x.

The recent deep learning revolution has ledmany to apply machine learning methods forestimation [Dong et al., 2015], [Ongie et al., 2020],[Gabrielli et al., 2017], [Rudi et al., 2020], [Dua, 2011].For this purpose, a representative dataset

DN = {yi,xi}Ni=1 (4)

is collected. Then, a class of possible estimatorsH is chosen,and the EMMSE estimator is defined as the solution to:

miny∈H

ENM

[‖y(x)− y‖2

]. (5)

The main contribution of this paper is an alternativeBCE approach in which we minimize the empirical MSEalong with an empirical squared bias regularization. For thispurpose, we collect an enhanced dataset

DNM ={yi, {xij}Mj=1

}Ni=1

, (6)

where xij are all associated with the same yi. BCE is thendefined as the solution to:

miny∈H

ENM

[‖y(x)− y‖2

]+ λEN

[∥∥∥EM [y(x)− y|y]∥∥∥2] .

(7)The objective function of the above optimization problemwill be referred as the "BCE loss". In the next sections,we will show that BCE leads to better approximationsof MVUE/MLE than EMMSE. It resembles the classicalmethods in parameter estimation. We also show that itis beneficial when multiple local estimators are averagedtogether. Finally, it is can be also used as regularization ina Bayesian setting when the prior is not known exactly.

III. BCE FOR APPROXIMATING MVUEThe main motivation to BCE is as an approximation to

MVUE (or an asymptotic MLE) when the latter are difficultto compute. To understand this idea, we recall the basicdefinitions in classical (non-Bayesian) parameter estimation.Briefly, a core challenge in this field is that the unknownsare deterministic parameters without any prior distributions.The bias, variance and MSE all depend on the unknownparameter y and cannot be minimized directly for all param-eter values simultaneously. A traditional workaround is toconsider only unbiased estimators and seek an MVUE. It iswell known that, asymptotically and under mild conditions,MLE is an MVUE. In practice, computing the MLE ofteninvolves difficult or intractable optimization problems andthere is ongoing research for efficient approximations. Inthis section, we show that BCE allows us to learn estimatorswith fixed and prescribed computational complexity that arealso asymptotically MVUE.

Classical parameter estimation is not data-driven butbased on a generative model. To enjoy the bene-fits of data-driven learning, it is common to gener-ate synthetic data using the model and train a network

Page 4: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

4

on this data [Gabrielli et al., 2017], [Rudi et al., 2020],[Dua, 2011]. Specifically, y is assumed random with somefictitious prior pfake(y) such that pfake(y) 6= 0 for all y ∈ S.An artificial dataset DN is synthetically generated accordingto pfake(y) and the true p (x;y). Standard practice is tothen learn the network using EMMSE in (5). This leadsto an estimator which is optimal on average with respectto a fictitious prior. In contrast, we propose to generate asynthetic data with multiple xij for the same yi and use theBCE optimization in (7). The complete method is providedin Algorithm 1 below.

We note that the BCE objective penalizes the averagesquared bias across the training set. Intuitively, to achieveMVUE performance we need to penalize the bias for anypossible value of y which would be much more difficultboth statistically and computationally (e.g., using a minmaxapproach). Nonetheless, our analysis below proves that sim-ply penalizing the average is sufficient.

Algorithm 1 BCE with synthetic data

• Choose a fictitious prior pfake(y) .• Generate N samples {yi}Ni=1 ∼ pfake(y).• For each yi, i = 1, · · · , N :

Generate M samples {xj(yi)}Mj=1 ∼ p(x|y).• Solve the BCE optimization in (7).

To analyze BCE, we begin with two definitions.

Definition 1. An estimator y (x) is called unbiased if itsatisfies BIAS(y) = 0 for all y ∈ S.

Definition 2. An MVUE is an unbiased estimator y∗ (x)that has a variance lower than or equal to that of any otherunbiased estimator for all values of y.

An MVUE exists in simple models and MLE is asymp-totically near MVUE [Kay and Kay, 1993]. The followingtheorem suggests that deep learning based estimators withBCE loss functions are also near MVUE.

Theorem 1. Assume that1) An MVUE denoted by yMVUE exists within H.2) The fake prior is non-singular, i.e., pfake(y) 6= 0 for

all y ∈ S.3) The variance of the MVUE estimate under the fake

prior is finite:∫

VARyMVUE(y)pfake(y)dy <∞.

Then, BCE coincides with the MVUE for sufficiently largeλ, M and N .

Before proving the theorem, we note that the first assump-tion can be met by choosing a sufficiently expressive anduniversal hypothesis class. The second and third assumptionsare technical and less important in practice.

Proof. The main idea of the proof is that because the squaredbias is not negative, it is equal to zero for all y if and onlyif its expectation over any non-singular prior is zero. Thus,taking λ to infinity enforces a solution that is unbiased forany value of y and among the unbiased solutions only theMSE (which is now equal to the variance) term in the BCEis left and thus the solution is the MVUE.

For sufficiently large M and N , (7) converges to itspopulation form:

miny∈HE[‖y(x)− y‖2

]+ λE

[‖E [y(x)− y|y]‖2

](8)

We now define Suppose that y∗(x) is the MVUE. Thus:

E [y∗(x)− y|y] = 0, ∀ y ∈ S

E[‖E [y∗(x)− y|y]‖2

]= 0

First, assume the BCE y1(x) is biased for some y ∈ S:

E [y1(x)− y|y] 6= 0

for some y ∈ S. Thus

E[‖E [y1(x)− y|y]‖2

]> 0.

Since E[‖y∗(x)− y‖2

]is finite, we get a contradiction

LBCE(y∗) < LBCE(y1),

for sufficiently large λ, where LBCE is the BCE loss.Second, assume BCE y2 is an unbiased estimator. But by

the definition of MVUE

E[‖y∗(x)− E [y∗|y]‖2 |y

]≤ E

[‖y2(x)− E [y2|y]‖2 |y

]∀ y ∈ S

and

E[‖y∗(x)− y‖2 |y

]≤ E

[‖y2(x)− y‖2 |y

]∀ y ∈ S.

Thus

LBCE(y∗) = E[‖y∗(x)− y‖2

]≤ E

[‖y2(x)− y‖2

]= LBCE(y2).

Page 5: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

5

Together, the BCE loss of the MVUE is smaller than theBCE whether it is biased or not.

To summarize, EMMSE is optimal on average with re-spect to it training set. Otherwise, it can be arbitrarily bad(see experiments in Section VI). On the other hand, BCE isoptimal for any value of y among the unbiased estimatorsand thus in problems where an efficient estimator exists,it achieves the Cramer Rao Bound [Kay and Kay, 1993]making its performance predictable on any value of y.

IV. LINEAR BCEIn this section, we focus on linear BCE (LBCE). This

results in a closed form solution which is computationallyefficient and interesting to analyze. LBCE is defined by theclass of linear estimators:

Hlinear = {y : y = Ax} (9)

This class is easy to fit, store and implement. In fact, it hasa simple closed form solution.

Theorem 2. The LBCE is given by y = Ax where

ALBCE(λ) = ENM[yxT

]( 1

λ+ 1ENM

[xxT

]+

(1− 1

λ+ 1

)EN

[EM [x|y] EM

[xT |y

]])−1(10)

assuming the inverse exists. LBCE(0) coincides with theclassical linear MMSE estimator, where the true expectationsare replaced by the empirical ones:

ALBCE(0) = ENM

[yxT

]ENM

[xxT

]−1. (11)

Note that the second term in the inverse is easy to computeby sampling from p(x|y).

Proof. The proof is based on calculating the gradient of theBCE objective function and equalizing it to zero. The fullproof is in the Supplementary Material.

Theorem 2 is applicable to any likelihood model, yet itis interesting to consider the classical case of linear models[Kay and Kay, 1993]:

x = Hy + n (12)

where n is a noise vector such that E [n] = 0. Moreprecisely, it is sufficient to consider the case where the

conditional mean of x is linear in y. Here, we focus onan “inverse problem” where we have access to a dataset ofN realizations of y and a stochastic generator of p(x;y). Bychoosing a large enough M , the conditional means convergeto their population values and EM [x|y]→Hy.

Theorem 3. Consider observations with a linear conditionmean

E [x|y] = Hy. (13)

For M →∞, LBCE reduces to

ALBCE(λ) = ΣyHT

(HΣyH

T +1

λ+ 1Σx

)−1=

(HT Σ

−1x H +

1

λ+ 1Σ−1y

)−1HT Σ

−1x (14)

where

Σy = EN[yyT

]Σx = EN [Σx(y)]

Σx(y) = E[(x− E [x|y])(x− E [x|y])T |y

](15)

and we assume all are invertible. Taking λ→∞ yields theseminal Weighted Least Squares (WLS) estimator1

ALBCE(∞) =(HT Σ

−1x H

)−1HT Σ

−1x . (16)

Proof. The proof is based on plugging (12) into (10) andusing the matrix inversion lemma. The full proof is in theSupplementary Material.

LBCE in (14) clearly shows that increasing λ reduces thedependence on the specific values yi used in training. Insome sense, BCE acts as a regularizer which is useful infinite sample settings. To gain more intuition on this effect,consider the following toy problem:

x = y + w

w ∼ N (0, 1). (17)

LBCE reduces to

aLBCE(λ) =y2

y2 + 1λ+1

(18)

where y2 =∑Ni=1 y

2i . Assume that an estimator is learned

using N random samples from the prior y ∼ N (0, ρ) where

1In fact, (16) is slightly more general than WLS which assumes thatΣx(y) does note depend on y.

Page 6: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

6

ρ is an SNR parameter. Let BMSEN denote the BayesianMSE defined in (2) but where the expectation is also withrespect to random sampling in training (details and proof inthe Supplementary Material). Then, we have the followingresult.

Theorem 4. For a large enough ρ, there exist λ∗ > 0such that LBCE(λ∗) yields a smaller BMSEN than that ofLBCE(λ = 0).

Intuitively, the prior is less important in high SNR, and itmakes sense to use BCE and be less sensitive to it. On theother hand, in low SNR, the prior is important and is usefuleven when its estimate is erroneous.

For completeness, it is interesting to compare BCEto classical regularizers. The most common approach re-lies on an L2 penalty, also known as ridge regression[Shalev-Shwartz and Ben-David, 2014]. Surprisingly, in theabove inverse problem setting with finite N and M → ∞,ridge regression does not help and even increases the BMSE.Ridge regression involves the following optimization prob-lem:

mina

N∑i=1

E[(axi − yi)2|yi

]+ λa2 (19)

and its solution is

aRidge(λ) =y2

y2 + 1 + ρλ(20)

Comparing (30) and (20) it is easy to see that ridge andBCE have opposite effects. Ridge increases the denominatorwhereas BCE increases it. The estimators are identical, i.e.,aRidge(λRidge) = aLBCE(λLBCE), if

λRidge =1

ρ

(1

λLBCE + 1− 1

)(21)

In high SNR, we already showed that BCE is beneficialand to get this effect one must resort to a negative ridgeregularization which is highly unstable. Interestingly, this isreminiscent of known results in linear models with uncer-tainty where Total Least Squares (TLS) coincides with anegative ridge regularization [Wiesel et al., 2008].

V. BCE FOR AVERAGING

In this section, we consider a different motivation toBCE where unbiasedness is a goal on its own. Specifically,BCEs are advantageous in scenarios where multiple esti-mators are combined together for improved performance.

Typical examples include sensor networks where each sensorprovides a local estimate of the unknown, or computervision applications where the same network is appliedon different crops of a given image to improve accuracy[Krizhevsky et al., 2012], [Szegedy et al., 2015].

The underlying assumption in BCE for averaging is thatwe have access, both in training and in test times, to multiplexij associate with the same yi. This assumption holds inlearning with synthetic data (e.g., Algorithm 1), or real worlddata with multiple views or augmentations. In any case, thedata structure allows us to learn a single network y(·) whichwill be applied to each of them and then output their averageas summarized in Algorithm 2. The following theorem thenproves that BCE results in asymptotic consistency.

Algorithm 2 BCE for averaging

• Let y(·) be the solution to BCE in (7).• Define the global estimator as

y(x) =1

Mt

Mt∑j=1

y(xj), (22)

where Mt is the number of local estimators at test time.

Theorem 5. Consider the case in which Mt independentand identically distributed (i.i.d.) measurements xj of thesame y are available. Thus BCE with a sufficiently largeλ, M and N , allows consistent estimation as Mt increasesif an unbiased estimator y∗(xj) with finite variance existswithin the hypothesis class.

Proof. Following the proof of theorem 1, BCE with asufficiently large λ, M and N results in a unbiased estimator,if one exists within the hypothesis class. The global metricssatisfy:

BIASy(y) =1

M

M∑j=1

BIASy(y) = BIASy(y)

VARy(y) =1

M2

M∑j=1

VARy(y) =1

MVARy(y)

M→∞→ 0

(23)

Thus, the global variance decreases with Mt, whereas theglobal bias remains constant, and for an unbiased localestimator it is equal to zero.

Page 7: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

7

Fig. 1: Bias of SNR, MSE of SNR and MSE of inverse SNR.

VI. EXPERIMENTS

In this section, we present numerical experiments re-sults. We focus on the main ideas and conclusions; all theimplementation details are available in the SupplementaryMaterial.

A. SNR EstimationOur first experiment addresses a non-convex estimation

problem of a single unknown. The unknown is scalar andtherefore we can easily compute the MLE and visualize itsperformance. Specifically, we consider non-data-aided SNRestimation [Alagha, 2001]. The received signal is

xi = aih+ wi (24)

where ai = ±1 are equi-probable binary symbols, h isan unknown signal and wi is a white Gaussian noisewith unknown variance denoted by σ2 and i = 1, · · · , n.The goal is to estimate the SNR defined as y = h2

σ2 .Different estimators were proposed to this problem, in-cluding MLE [Wiesel et al., 2002] and method of moments[Pauluzzi and Beaulieu, 2000]. For our purposes, we train anetwork based on synthetic data using EMMSE and BCE.

Figure 1 compares the MSE and the bias of the differentestimators as a function of the SNR. It is evident that BCEis a better approximation of MLE than EMMSE. EMMSEis very biased towards a narrow regime of the SNR. This isbecause the MSE scales as the square of the SNR and theaverage MSE loss is dominated by the large MSE examples.For completeness, we also plot the MSE in terms of inverseSNR:

E

[(1

y− 1

y

)2]

(25)

Functional invariance is a well known and attractive propertyof MLE. The figure shows that both MLE and BCE arerobust to the inverse transformation, whereas EMMSE isunstable and performs poorly in low SNR.

B. Structured Covariance Estimation

Our second experiment considers a more interesting highdimensional structured covariance estimation. Specifically,we consider the estimation of a sparse covariance matrix[Chaudhuri et al., 2007]. The measurement model is

p(x; Σ) ∼ N (0,Σ)

Σ =

1 + y1 0 0 1

2y6 00 1 + y2 0 1

2y7 00 0 1 + y3 0 1

2y812y6

12y7 0 1 + y4

12y9

0 0 12y8

12y9 1 + y5

(26)

where 0 ≤ yi ≤ 1 are unknown parameters. We traina neural network using pfake(yi) ∼ U(0, 1) using bothEMMSE and BCE. Computing the MLE is non-trivial inthese settings and therefore we compare the performance tothe theoretical asymptotic variance defined by the CramerRao Bound (CRB) [Kay and Kay, 1993]:

VARy(y) ≥ Tr(F−1(y)) (27)

where F (y) is the Fisher Information Matrix (FIM)

[F (y)]ij =1

2Tr

[Σ−1(y)

∂Σ−1(y)

∂yiΣ−1(y)

∂Σ−1(y)

∂yj

](28)

Page 8: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

8

CRB depends on the specific values of y. Therfore wetake random realizations and provide scatter plots orderedaccording to the CRB value.

Figure 2 presents the results of two simulations. In thefirst, we generate data according to the training distributiony ∼ U(0, 1). As expected, EMMSE which was optimizedfor this distribution provides the best MSEs. In the second,we generate data according to a different distribution y ∼U(0, 0.4). Here the performance of EMMSE significantlydeteriorates. In both cases, BCE is near MVUE and providesMSEs close to the CRB while ignoring the prior distributionused in training.

Fig. 2: Scatter plot of MSEs ordered by CRB values. In (a)the tested y’s are generated from the training distribution. In(b) the test distribution is different. BCE is near-MVUE andclose to the CRB for both distributions, whereas EMMSE isbetter in (a) and weaker in (b).

C. BCE as Regularization

Our third experiment considers the use of LBCE as aregularizer in a linear model with additive Gaussian noise.Theorem 4 analyzed this scalar case, and here we addressthe high dimensional case using numerical simulations. Theunderlying model is

x = Hy +w

w ∼ N (0,Σw)

y ∼ N (0,Σy) (29)

Dimensions are x ∈ Rn with n = 20 and y ∈ Rd withd = 20, and Σy is a rank deficient. We assume a limitednumber of training samples {yi}Ni=1, but full knowledge ofthe measurement and noise models that allow M → ∞.We train three linear regressors using the EMMSE loss,EMMSE plus Ridge loss and BCE. Optimal regularizationhyperparameters are chosen using a large validation set.

Figure 3 shows the resulting MSEs as a function of N .As predicted by the theory, BCE significantly improves theresults when N is small. The optimal ridge parameter isnegative and provides a negligible gain.

Fig. 3: MSE of BCE, EMMSE and Ridge as a function ofthe number of samples. BCE outperforms the rest when Nis small.

D. Image Classification with Soft Labels

Our fourth experiment considers BCE for averaging in thecontext of image classification. We consider the popular CI-

Page 9: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

9

FAR10 dataset. This paper focuses on regression rather thanclassification. Therefore we consider soft labels as proposedin the knowledge distillation technique [Hinton et al., 2015].The soft labels are obtained using a strong “teacher” networkfrom [Yu et al., 2018]. To exploit the benefits of averagingwe rely on data augmentation in the training and testphases [Krizhevsky et al., 2012]. For augmentation, we userandom cropping and flipping. We train two small “student”convolution networks with identical architectures using theEMMSE loss and the BCE loss. More precisely, followingother distillation works, we also add a hard valued crossentropy term to the losses.

Figure 4 compares the accuracy of EMMSE vs BCE as afunction of the number of test-time data augmentation crops.It can be seen that while on a single crop EMMSE achievesa slightly better accuracy, BCE achieves better results whenaveraged over many crops in the test phase.

Fig. 4: BCE for averaging: accuracy as a function of thenumber of test-time augmentation crops.

VII. SUMMARY

In recent years, deep neural networks (DNN) are replacingclassical algorithms for estimation in many fields. WhileDNNs give remarkable improvement in performance "onaverage", in some situations one would prefer to use classicalfrequentist algorithms that have guarantees on their perfor-mance on any value of the unknown parameters. In this workwe show that when an efficient estimator exists, deep neural

networks can be used to learn it using a bias constrained loss,provided that the architecture is expressive enough. Furtherwork will be extension to different objective functions inaddition to the MSE and applying the concept of BCE onreal world problems.

VIII. ACKNOWLEDGMENT

This research was partially supported by ISF grant2672/21

REFERENCES

[Agarwal et al., 2019] Agarwal, A., Dudik, M., and Wu, Z. S. (2019). Fairregression: Quantitative definitions and reduction-based algorithms. InInternational Conference on Machine Learning, pages 120–129. PMLR.

[Alagha, 2001] Alagha, N. S. (2001). Cramer-rao bounds of snr estimatesfor bpsk and qpsk modulated signals. IEEE Communications letters,5(1):10–12.

[Arjovsky et al., 2019] Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprintarXiv:1907.02893.

[Bagnell, 2005] Bagnell, J. A. (2005). Robust supervised learning. InAAAI, pages 714–719.

[Chaudhuri et al., 2007] Chaudhuri, S., Drton, M., and Richardson, T. S.(2007). Estimation of a covariance matrix with zeros. Biometrika,94(1):199–216.

[Creager et al., 2021] Creager, E., Jacobsen, J.-H., and Zemel, R. (2021).Environment inference for invariant learning. In International Confer-ence on Machine Learning, pages 2189–2200. PMLR.

[Diskin et al., 2017] Diskin, T., Draskovic, G., Pascal, F., and Wiesel, A.(2017). Deep robust regression. In 2017 IEEE 7th International Work-shop on Computational Advances in Multi-Sensor Adaptive Processing(CAMSAP), pages 1–5. IEEE.

[Dong et al., 2015] Dong, C., Loy, C. C., He, K., and Tang, X. (2015).Image super-resolution using deep convolutional networks. IEEE trans-actions on pattern analysis and machine intelligence, 38(2):295–307.

[Dreifuerst and Heath Jr, 2021] Dreifuerst, R. and Heath Jr, R. W. (2021).Signalnet: A low resolution sinusoid decomposition and estimationnetwork. arXiv preprint arXiv:2106.05490.

[Dua, 2011] Dua, V. (2011). An artificial neural network approximationbased decomposition approach for parameter estimation of system ofordinary differential equations. Computers & chemical engineering,35(3):545–553.

[Eldar and Merhav, 2005] Eldar, Y. C. and Merhav, N. (2005). Minimaxmse-ratio estimation with signal covariance uncertainties. IEEE Trans-actions on Signal Processing, 53(4):1335–1347.

[Friedman, 2017] Friedman, J. H. (2017). The elements of statisticallearning: Data mining, inference, and prediction. springer open.

[Gabrielli et al., 2017] Gabrielli, L., Tomassetti, S., Squartini, S., andZinato, C. (2017). Introducing deep machine learning for parameterestimation in physical modelling. In Proceedings of the 20th Interna-tional Conference on Digital Audio Effects.

[Han et al., 2016] Han, W., Khorrami, P., Paine, T. L., Ramachandran, P.,Babaeizadeh, M., Shi, H., Li, J., Yan, S., and Huang, T. S. (2016). Seq-nms for video object detection. arXiv preprint arXiv:1602.08465.

[Hinton et al., 2015] Hinton, G., Vinyals, O., and Dean, J. (2015). Distill-ing the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

Page 10: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

10

[Ho, 2012] Ho, K. (2012). Bias reduction for an explicit solution ofsource localization using tdoa. IEEE Transactions on Signal Processing,60(5):2101–2114.

[Izacard et al., 2019] Izacard, G., Mohan, S., and Fernandez-Granda, C.(2019). Data-driven estimation of sinusoid frequencies. arXiv preprintarXiv:1906.00823.

[Jaynes, 1968] Jaynes, E. T. (1968). Prior probabilities. IEEE Transactionson systems science and cybernetics, 4(3):227–241.

[Kay and Kay, 1993] Kay, S. M. and Kay, S. M. (1993). Fundamentals ofstatistical signal processing: estimation theory, volume 1. Prentice-hallEnglewood Cliffs, NJ.

[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E.(2012). Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25:1097–1105.

[Lehmann and Scheffe, 1948] Lehmann, E. and Scheffe, H. (1948). Com-pleteness, similar regions, and unbiased estimation. In Bulletin of theAmerican Mathematical Society, volume 54, pages 1080–1080.

[Li and AlRegib, 2009] Li, J. and AlRegib, G. (2009). Distributed estima-tion in energy-constrained wireless sensor networks. IEEE Transactionson Signal Processing, 57(10):3746–3758.

[Luengo et al., 2015] Luengo, D., Martino, L., Elvira, V., and Bugallo, M.(2015). Bias correction for distributed bayesian estimators. In 2015 IEEE6th International Workshop on Computational Advances in Multi-SensorAdaptive Processing (CAMSAP), pages 253–256. IEEE.

[Maity et al., 2020] Maity, S., Mukherjee, D., Yurochkin, M., and Sun, Y.(2020). There is no trade-off: enforcing fairness can improve accuracy.arXiv preprint arXiv:2011.03173.

[Naimipour et al., 2020] Naimipour, N., Khobahi, S., and Soltanalian, M.(2020). Unfolded algorithms for deep phase retrieval. arXiv preprintarXiv:2012.11102.

[Ongie et al., 2020] Ongie, G., Jalal, A., Metzler, C. A., Baraniuk, R. G.,Dimakis, A. G., and Willett, R. (2020). Deep learning techniquesfor inverse problems in imaging. IEEE Journal on Selected Areas inInformation Theory, 1(1):39–56.

[Pauluzzi and Beaulieu, 2000] Pauluzzi, D. R. and Beaulieu, N. C. (2000).A comparison of snr estimation techniques for the awgn channel. IEEETransactions on communications, 48(10):1681–1691.

[Rudi et al., 2020] Rudi, J., Bessac, J., and Lenzi, A. (2020). Parameterestimation with dense and convolutional neural networks applied to thefitzhugh-nagumo ode. arXiv preprint arXiv:2012.06691.

[Schlemper et al., 2018] Schlemper, J., Yang, G., Ferreira, P., Scott, A.,McGill, L.-A., Khalique, Z., Gorodezky, M., Roehl, M., Keegan, J.,Pennell, D., et al. (2018). Stochastic deep compressive sensing for thereconstruction of diffusion tensor cardiac mri. In International confer-ence on medical image computing and computer-assisted intervention,pages 295–303. Springer.

[Shalev-Shwartz and Ben-David, 2014] Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory toalgorithms. Cambridge university press.

[Simonyan and Zisserman, 2014] Simonyan, K. and Zisserman, A. (2014).Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556.

[Szegedy et al., 2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed,S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015).Going deeper with convolutions. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages 1–9.

[Wald et al., 2021] Wald, Y., Feder, A., Greenfeld, D., and Shalit, U.(2021). On calibration and out-of-domain generalization. arXiv preprintarXiv:2102.10395.

[Wiesel et al., 2008] Wiesel, A., Eldar, Y. C., and Yeredor, A. (2008).

Linear regression with gaussian model uncertainty: Algorithms andbounds. IEEE Transactions on Signal Processing, 56(6):2194–2205.

[Wiesel et al., 2002] Wiesel, A., Goldberg, J., and Messer, H. (2002). Non-data-aided signal-to-noise-ratio estimation. In 2002 IEEE InternationalConference on Communications. Conference Proceedings. ICC 2002(Cat. No. 02CH37333), volume 1, pages 197–201. IEEE.

[Yang et al., 2019] Yang, K., Huang, Z., Wang, X., and Wang, F. (2019).An snr estimation technique based on deep learning. Electronics,8(10):1139.

[Yu et al., 2018] Yu, F., Wang, D., Shelhamer, E., and Darrell, T. (2018).Deep layer aggregation. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 2403–2412.

[Zhang et al., 2013] Zhang, Y., Duchi, J. C., and Wainwright, M. J. (2013).Communication-efficient algorithms for statistical optimization. TheJournal of Machine Learning Research, 14(1):3321–3363.

Page 11: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

11

IX. APPENDIX

X. PROOFS OF THEOREMS

A. Proof of Theorem 2

We insert y = Ax in (8) and get:

BCE = EN

[EM

[‖Ax− y‖2 |y

]+ λ

∥∥∥EM [Ax− y|y]∥∥∥2]

= EN

[EM

[xTATAx− xTATy − yTAx+ yTy|y

]]+λEN

[EM

[xTAT |y

]EM [Ax|y]− EM

[xTAT |y

]y − yEM [Ax|y] + yTy

]= EN

[EM

[xTATAx|y

]]+ λEN

[EM

[xTAT |y

]EM [Ax|y]

]−(λ+ 1)EN

[EM

[xTAT |y

]y + yEM [Ax|y]− yTy

]= Tr

(AENM

[xxT

]AT)

+ λEN

[Tr(AEM [xy] EM

[xT])]

−(λ+ 1)EN

[Tr(AEM [x|y]yT

)+ Tr

(yEM

[xT |y

]AT)− yyT

]Now we take the derivative with respect to A and compare to zero:

0 =∂BCE

∂A= 2

(ENM

[xxT

]+ λEN

[EM [x|y] EM

[xT |y

]])AT − 2(λ+ 1)EN

[EM [x|y]yT

]

Using EN

[EM [x|y]yT

]= ENM

[xyT

], yields

A = (λ+ 1)ENM[yxT

] (ENM

[xxT

]+ λEN

[EM [x|y] EM

[xT |y

]])−1= ENM

[yxT

]( 1

λ+ 1ENM

[xxT

]+

(1− 1

λ+ 1

)EN

[EM [x|y] EM

[xT |y

]])−1

completing the proof.

B. Proof of Theorem 3

We assume that Σy is positive definite and we can define it square root Σ1/2

y which is also positive definite. In addition,denote n = x− E [x|y] = x−Hy and thus E

[nnT |y

]= Σx(y) and E [n|y]=0. Thus, we insert x = Hy + n in (14)

Page 12: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

12

and get:

A = ENM

[y(yTHT + nT )

]( 1

λ+ 1ENM

[(Hy + n)(yTHT + nT )

]+

(1− 1

λ+ 1

)EN

[E [Hy + n|y] E

[yTHT + nT |y

]])−1

= EN

[yyTHT + yE

[nT |y

]]( 1

λ+ 1

(EN

[HyyTHT +HyE

[nT |y

]+ E [n|y]yTHT + E

[nnT |y

]])+

(1− 1

λ+ 1

)(EN

[HyyTHT +HyE

[nT |y

]+ E [n|y]yTHT + E [n|y] E

[nT |y

]]))−1

= ΣyHT

(HΣyH

T +1

λ+ 1Σx

)−1= Σ

1/2

y Σ1/2

y HT

(HΣ

1/2

y Σ1/2

y HT +1

λ+ 1Σx

)= (λ+ 1)Σ

1/2

y

1/2

y HT (λ+ 1)Σ−1x HΣ

1/2

y + I)−1

Σ1/2

y HT Σ−1x

= Σ1/2

y

1/2

y

(HT Σ

−1x H +

1

λ+ 1Σ−1y

1/2

y

)−1Σ

1/2

y HT Σ−1x

=

(HT Σ

−1x H +

1

λ+ 1Σ−1y

)−1HT Σ

−1x

completing the proof.

C. Proof of Theorem 4In this case, (14 of the main paper) becomes:

aLBBCE(λ) =

(1 +

1

(λ+ 1) 1N

∑Ni=1 y

2i

)−1=

zNzN + α

λ+1

(30)

where 1N

∑Ni=1 y

2i = ρzN and α = 1

ρ . The estimator and its error depend on the training set only through zN :

BMSE = E[(aLBCE(λ)x− y)2|zN

]=

α+ z2N (λ+ 1)2

((λ+ 1)zN + α)2(31)

The proof continues by showing that for small enough α the derivative of the expectation of (31) over zN with respect toλ in λ = 0 is negative, and there exist some λ∗ such that

E[(aLBCE(λ∗)x− y)2|

]< E

[(aLBCE(0)x− y)2|

]. (32)

The derivative of (31) with a respect to λ is:

∂BMSE

∂λ= 2α

zN ((λ+ 1)zN − 1)

((λ+ 1)zN + α)3(33)

Page 13: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

13

In particular, in high SNR we have

∂BMSE

∂λ

∣∣∣∣λ=0

= 2αzN (zN − 1)

(zN + α)3α→0∝ zN − 1

z2N(34)

Therefore if we take the expectation over training set:

∂BMSEN

∂λ

∣∣∣∣λ=0

= E

[∂BMSE

∂λ

∣∣∣∣λ=0

]∼ E

[zN − 1

z2N

]=

∫ ∞0

zN − 1

z2NP (zN )dzN (35)

Using

E [zN ] =1

N

N∑i=1

E[y2i]

ρ= 1 (36)

yields ∫ ∞0

(zN − 1)P (zN )dzN =

∫ 1

0

(zN − 1)P (zN )dzN +

∫ ∞1

(zN − 1)P (zN )dzN = 0 (37)

and thus∫ 1

0

zN − 1

z2NP (zN )dzN ≤

∫ 1

0

(zN − 1)P (zN )dzN = −∫ ∞1

(zN − 1)P (zN )dzN ≤ −∫ ∞1

zN − 1

z2NP (zN )dzN (38)

and∂BMSEN

∂λ

∣∣∣∣λ=0

= E

[∂BMSE

∂λ

∣∣∣∣λ=0

]∼∫ ∞0

zN − 1

z2NP (zN )dzN ≤ 0. (39)

for all distributions P (zN ) of positive variable with mean 1, and the equality is only if zN is deterministic and equal to1, completing the proof.

XI. IMPLEMENTATION DETAILS OF THE EXPERIMENTS

A. SNR EstimationWe train a simple fully connected model with one hidden layer. First the data is normalized by the second moment and

then the input is augmented by hand crafted features: the fourth and sixth moments and different functions of them. Wetrain the network using N = 50 synthetic data in which the mean is sampled uniformly in [1, 10] and then SNR is sampleduniformly in [2, 50] (which corresponds to [3dB, 16dB]). The data is generated independently in each batch. We trainedthe model using the standard MSE loss, and using BCE with λ = 1000. We use ADAM solver with a multistep learninigscheduler. We use batch sizes of N = 10 and M = 100 as defined (8).

B. Structured Covariance EstimationWe train a neural network for estimation the covariance matrix with the following architecture: First the sample covariance

C0 is calculated from the input x as it is the sufficient statistic for the covariance in Gaussian distribution. Also a vectorαk=0 is initialized to αk=0 = 1

219 and a vector vk=0 is initialized to zero. Next, a one hidden layer fully connectednetwork with concatenated input of C0 and vk=0 is used to predict a modification ∆α and ∆v for the vectors αk and vkrespectively, such that αk+1 = αk + 0.1∆α and similarly for vk+1. Then an updated covariance Ck+1 is calculated usingαk+1 and equation (26). The process is repeated (with the updated Ck and vk as an input to the fully connected network)for 50 iterations. The final covariance is the output of the network. The network in is tranied on synthetic data in whichthe covariance the paramaters of the covariance are generated uniformly in their valid region and then M different X’sare generated from a normal distribution with a zero mean and the generated covariance. We use an ADAM solver witha "ReduceLROnPlateau" scheduler with the desired loss (BCE loss of BCE and MSE loss for EMMSE) on a syntheticvalidation set. We trained the model using the standard MSE loss, and using BCE with λ = 1000. We use batch sizes ofN = 1 and M = 20 as defined (8).

Page 14: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

14

C. BCE as Regularization

The underlying model is in the experiment is:

x = Hy +w

w ∼ N (0,Σw)

y ∼ N (0,Σy) (40)

Dimensions are x ∈ Rn with n = 20 and y ∈ Rd with d = 20. The covariance matrix of the noise w is non diagonal andsatisfies 1

nTr(Σ)=1. The covariance matrix of the prior of y is non diagonal in which five of the eigenvalues are equal to0.01 and the other are equal to 100, that is it is approximately low rank. In order to take the expectation over the trainingset, the experiment is repeated 500 times with independently sampled training sets (that is, 500 time different N samplesof the prior). We assume a limited number of training samples {yi}Ni=1, but full knowledge of the measurement. Thus weuse (14) for BCE (and λ = 0 for EMMSE as a special case) and:

ARidge(λ) =

(HT

(Σx + λI

)−1H + Σ

−1y

)−1HT

(Σx + λI

)−1, (41)

for Ridge.The best λ was found using a grid search over 100 values of λ in [0, 10] for BCE and in [−0.012, 0.002] for Ridge (as

the optimal parameter was found experimentally to be in these regions in the above setup.

D. Image Classification with Soft Labels

We generate soft labels using a "teacher" network. Specifically, we work on the CIFAR10 dataset, and use a DLAarchitecture [Yu et al., 2018] which achieves 0.95 accuracy as a teacher. Our student network is a very small convolutionalneural network (CNN) with two convolutions layers with 16 and 32 channels respectively and a single fully connectedlayer. We now use the following notations: The original dataset is a set of N triplets of images xi, a one-hot vector ofhard labels yhi and a vector of soft labels ysi {xi,yhi ,ysi}. In the augmented data, M different images xij are generatedrandomly from each original image xi using random cropping and flipping. The output of the network for the class l isdenoted by zlij and the vector of "probabilities" qij(T ) is defined by:

qlij(T ) =exp(qlij/T )∑l exp(zlij/T )

(42)

where T is a "temperature" that controls the softness of the probabilities. We define the following loss functions:

Lhard =

MN∑ij

CE(qij(T = 1),yhi )

LMSE =

MN∑ij

∥∥qij(T = 20)− ysi∥∥2

Lbias =

N∑i

∥∥∥∥∥∥M∑j

qij(T = 20)− ysi

∥∥∥∥∥∥2

where CE is the cross-entropy loss. The regular network and the BCE are trained with the following losses:

LEMMSE = Lhard + LMSE

LBCE = Lhard + Lbias

Page 15: Tzvi Diskin, Yonina C. Eldar and Ami Wiesel

15

using stochastic gradient decent (SGD). In training, we use 20 different random crops and flips. We test the trained networkin five different checkpoints during training and calculate the average and the standard deviation. In test-time we use thesame data augmentation as in the training, the scores of the different crops are averaged to get the final score as in algorithm2. Fig. 4 shows the average and the standard deviation of the accuracy as a function of the number of crops in the trainingset.