Scoring Rules, Generalized Entropy, and Utility Maximization Robert Nau Fuqua School of Business...

Scoring Rules, Generalized Entropy, and Utility Maximization

Robert Nau

Fuqua School of Business

Duke University

(with Victor Jose and Robert Winkler)

Presentation for IEOR Seminar

Berkeley, October 29, 2006

Overview• Scoring rules are reward functions for defining

subjective probabilities and eliciting them in forecasting applications and experimental economics (de Finetti, Brier, Savage, Selten...)

• Cross-entropy, or divergence, is a physical measure of information gain in communication theory and machine learning (Shannon, Kullback-Leibler...)

• Utility maximization is the decision maker’s objective in Bayesian decision theory and game theory (von Neumann & Morgenstern, Savage...)

General connections

• Any decision problem under uncertainty may be used to define a scoring rule or measure of divergence between probability distributions.

• The expected score or divergence is merely the expected-utility gain that results from solving the problem using the decision maker’s “true” probability distribution p rather than some other “baseline” distribution q.

Specific results• We explore the connections among the best-known

parametric families of generalized scoring rules, divergence measures, and utility functions.

• The expected scores obtained by truthful probability assessors turn out to correspond exactly to well-known generalized divergences.

• They also correspond exactly to expected-utility gains in financial investment problems with utility functions drawn from the linear-risk-tolerance (a.k.a. HARA) family.

• These results generalize to incomplete markets via a primal-dual pair of convex programs.

Part 1: Scoring rules• Consider a probability forecast for a discrete event

with n possible outcomes (“states of the world”).

• Let ei = (0, ..., 1, ..., 0) denote the indicator vector for the ith state (where 1 appears in the ith position).

• Let p = (p1, ..., pn) denote the forecaster’s true subjective probability distribution over states.

• Let r = (r1, ..., rn) denote the forecaster’s reported distribution (if different from p).

• Let q = (q1, ..., qn) denote a baseline (“prior”) distribution upon which the forecaster seeks to improve.

Definition of a scoring rule

• A scoring rule is a function S(r, ei, q) that determines the forecaster’s score (reward) for giving the forecast r, relative to the baseline q, when the ith state is subsequently observed to occur.

• Let denote the forecaster’s expected score for reporting r when her true distribution is p and the baseline distribution is q.

• Thus, in general, a scoring rule can be expressed as a function of three vector-valued arguments, which is linear in the 2nd argument.

Proper scoring rules• The scoring rule S is [strictly] proper if

S(p, p, q) [>] S(r, p, q) for all r [p], i.e., if the forecaster’s expected score is [uniquely] maximized when she reports her true probabilities.

• Henceforth let denote the forecaster’s expected score for a truthful forecast, as a function of p and q.

• S is [strictly] proper iff is [strictly] convex.

Proper scoring rules, continued

• If S is strictly proper, then it is uniquely determined from by McCarthy’s (1956) formula:

• Thus, a strictly proper scoring rule is completely characterized by its expected-score function.

• Henceforth only strictly proper scoring rules will be considered, and it will be assumed that r = p.

Standard scoring rules

• The three most commonly used scoring rules all assume a uniform baseline distribution (q = 1/n), which will be temporarily suppressed.

• Quadratic scoring rule:

• Spherical scoring rule:

• Logarithmic scoring rule:

History of standard scoring rules

• The quadratic scoring rule was introduced by de Finetti (1937, 1974) to define subjective probability; later used by Brier (1950) as a tool for evaluating and paying weather forecasters; more recently advocated by Selten (1998) for paying subjects in economic experiments.

• The spherical and logarithmic rules were introduced by I.J. Good (1971), who also noted that the spherical and quadratic rules could be generalized to positive exponents other than 2, leading to...

Generalized scoring rules (uniform q)

• Power scoring rule ( quadratic at = 2):

• Pseudospherical scoring rule ( spherical at = 1)

• Both rules rescaled logarithmic rule at = 1.

Weighted scoring rules (arbitrary q)

• Our first contribution is to merely point out that the power and pseudospherical rules can be weighted by an arbitrary baseline distribution q and scaled so as to be valid for all real .

• Under the weighted rules, the score is zero in all states iff p q, and the expected score is positive iff p q.

• Thus, the weighted rules measure the “value added” of p over q as seen from the forecaster’s perspective.

Weighted power scoring rule:

Weighted pseudospherical scoring rule:

Properties of weighted scoring rules

• Both rules are strictly proper for all real .

• Both rules weighted logarithmic rule ln(pi/qi) at =1.

• For the same p, q, and , the vector of weighted power scores is an affine transformation of the vector of weighted pseudospherical scores, since both are affine functions of (pi/qi)1.

• However, the two rules present different incentives for information-gathering and honest reporting.

• The special cases = 0 and = ½ have interesting properties but have not been previously studied.

Special cases of weighted scores

• Weighted power expected score:

• Weighted pseudospherical expected score:

Weighted expected score functions

Special cases of expected scores

Power Pseudospherical

Figure 1. Weighted power score vs. beta (uniform q)

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

State 1 (p=0.05)

State 2 (p=0.25)

State 3 (p=0.70)

• Behavior of the weighted power score for n = 3.

• For fixed p and q, the scores diverge as .

• For << 0 [ >> 2] only the lowest [highest] probability event is distinguished from the others.

• By comparison, the weighted pseudospherical scores approach fixed limits as .

• Again, for << 0 [ >> 2] only the lowest [highest]

probability event is distinguished from the others.

Figure 2. Weighted pseudospherical score vs. beta (uniform q)

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

State 1 (p=0.05)

State 2 (p=0.25)

State 3 (p=0.70)

The corresponding expected scores vs. are equal at = 1, where both rules converge to the weighted logarithmic scoring rule, but elsewhere the weighted power expected score is strictly larger.

Figure 3. Expected scores vs. beta (p=0.05, 0.25, 0.70, uniform q)

0.2

0.4

0.6

0.8

1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

Pseudospherical

Power

Part 2. Entropy

• In statistical physics, the entropy of a system with n possible internal states having probability distribution p is defined (up to a multiplicative constant) by

• In communication theory, the negative entropy H(p) is the “self-information” of an event from a stationary random process with distribution p, measured in terms of the average number of bits required to optimally encode it (Shannon 1948).

The KL divergence

• The cross-entropy, or Kullback-Leibler divergence, between two distributions p and q measures the expected information gain (reduction in average number of bits per event) due to replacing the “wrong” distribution q with the “right” distribution p:

Properties of the KL divergence

• Additivity with respect to independent partitions of the state space:

• Thus, if A and B are independent events whose initial distributions qA and qB are respectively updated to pA and pB, the total expected information gain in their product space is the sum of the separate expected information gains, as measured by their KL divergences.

Properties of the KL divergence

• Recursivity with respect to the splitting of events:

• Thus, the total expected information gain does not depend on whether the true state is resolved all at once or via a sequential splitting of events.

Other divergence/distance measures

• The Chi-square divergence (Pearson 1900) is used by frequentist statisticians to measure goodness of fit:

• The Hellinger distance is a symmetric measure of distance between two distributions that is popular in machine learning applications:

Onward to generalized divergence...

• The properties of additivity and recursivity can be considered as axioms for a measure of expected information gain which imply the KL divergence.

• However, weaker axioms of “pseudoadditivity” and “pseudorecursitivity” lead to parametric families of generalized divergence.

• These generalized divergences “interpolate” and “extrapolate” beyond the KL divergence, the Chi-square divergence, and the Hellinger distance.

Power divergence• The directed divergence of order , a.k.a. the power

divergence, was proposed by Havrda & Chavrát (1967) and further elaborated by Rathie & Kannappan (1972), Cressie & Read (1980), Haussler and Opper (1997), among others:

• It is pseudoadditive and pseudorecursive for all , and it coincides with the KL divergence at = 1.

• It is identical to the weighted power expected score, hence the power divergence is the implicit information measure behind the weighted power scoring rule.

Pseudospherical divergence• An alternative generalized entropy was introduced by Arimoto

(1971) and further studied by Sharma & Mittal (1975), Boekee & Van der Lubbe (1980) and Lavenda & Dunning-Davies (2003), for >1:

• The corresponding divergence, which we call the pseudospherical divergence, is obtained by introducing a baseline distribution q and dividing out the unnecessary in the numerator:

Properties of the pseudospherical divergence

• It is defined for all real (not merely > 1).

• It is pseudoadditive but generally not pseudorecursive.

• It is identical to the weighted pseudospherical expected score, hence the pseudospherical divergence is the implicit information measure behind the weighted pseudospherical scoring rule.

Interesting special cases• The power and pseudospherical divergences

both coincide with the KL divergence at = 1.

• At = 0, = ½, and = 2 they are linearly (or at least monotonically) related to the reverse KL divergence, the squared Hellinger distance, and the Chi-square divergence, respectively:

Where we’ve gotten so far...

• There are two parametric families of weighted, strictly proper scoring rules which correspond exactly to two well-known families of generalized divergence, each of which has a full “spectrum” of possibilities ( < < ).

• But what is the decision-theoretic significance of these quantities?

• What are some guidelines for choosing among the the two families and their parameters?

Part 3. Financial decisions under uncertainty with linear risk tolerance

• Suppose that an investor with subjective probability distribution p and utility function u bets or trades optimally against a risk-neutral opponent or contingent claim market with distribution q.

• For any risk-averse utility function, the investor’s gain in expected utility yields an economic measure of the divergence between p and q.

• In particular, suppose the investor’s utility function belongs to the linear risk tolerance (HARA) family, i.e., the family of generalized exponential, logarithmic, and power utility functions.

Risk aversion and risk tolerance• Let y denote gain or loss relative to a (riskless) status

quo wealth position, and let u(y) denote the utility of y.

• The monetary quantity (y) u (y)/u (y) is the investor’s local risk tolerance at y (the reciprocal of the Pratt-Arrow measure of local risk aversion).

• The usual decision-analytic rule of thumb is as follows: an investor with current wealth y and local risk tolerance (y) is roughly indifferent to accepting a50-50 gamble between the wealth positions y (y) and y ½(y), i.e., indifferent to gaining (y) or losing ½(y) with equal probability.

Linear risk tolerance (LRT) utility

• The most commonly used utility functions in decision analysis and financial economics have the property of linear risk tolerance, i.e., (y) = + y,where > 0 is the risk tolerance coefficient.

• If the unit of money is normalized so that the risk tolerance equals 1 at y = 0 (status quo wealth), then = 1, and the utility function is u(y) = g (y), where:

Special cases of normalized LRT utility

Figure 4. Normalized LRT utility functions (beta = risk tolerance coefficient)

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1beta = -1 quadratic

beta = 0 exponential

beta = 1 logarithmic

beta = 2 square-root

Qualitative properties of LRT utilityg (0) = 0 and g (0) = 1 for all : the functions {g (y)}

are mutually tangent with dollar-utile parity at y = 0.

The investor’s decision model

• Model Y: the investor seeks the payoff vector y that maximizes her own LRT expected utility under her distribution p subject to not decreasing the opponent’s linear expected utility (i.e., expected value) under his distribution q.

• The investor’s reward in state i is her own ex post utility payoff g (yi).

A modified decision model

• Model Y: the investor seeks the payoff vector y that maximizes the sum of her own LRT expected utility under her distribution p and the opponent’s linear expected utility (expected value) under his distribution q.

• The investor’s reward in state i is her own ex post utility payoff g (yi) plus the opponent’s ex ante expected monetary payoff.

Main result

1. In the solution of Model Y, the investor’s utility payoff in state i is the weighted pseudospherical score, whose expected value is the pseudospherical divergence.

2. In the solution of Model Y, the investor’s utility payoff in state i is the weighted power score, whose expected value is the power divergence.

3. For any p, q, and , the weighted power expected score (power divergence) is greater than or equal to the weighted pseudospherical expected score (pseudospherical divergence).

Observations

• Insofar as Model Y is a more “realistic” investment problem than Model Y, the pseudospherical divergence appears to be more economically meaningful than the power divergence.

• The same results are obtained if the investor is endowed with linear utility while the opponent is risk averse with risk tolerance coefficient 1.

• Both of these problems involve non-decreasing risk tolerance on the part of the more-risk-averse agent only if is between 0 and 1.

Extension to incomplete markets• Suppose the investor faces an incomplete market in

which asset prices are supported by a convex set of risk neutral distributions.

• Let Q denote the matrix whose rows are the extreme points of the set of risk neutral distributions.

• Then the investor seeks the payoff vector y that maximizes her own LRT expected utility under her distribution p subject to the constraint Qy 0.

• This is a convex optimization problem whose dual is to find the risk neutral distribution in the convex hull of the rows of Q that minimizes the pseudospherical divergence from p.

Details of duality relationship

• Let z denote a vector of non-negative weights, summing to 1, for the k rows of Q.

• Then zTQ is a supporting risk neutral distribution in the convex hull of the rows of Q, and the primal-dual pair of optimization problems is as follows:

Conclusions• The commonly used power & pseudospherical scoring

rules can be improved by incorporating a not-necessarily-uniform baseline distribution.

• The resulting weighted expected scores are equal to well-known generalized divergences.

• The weighted pseudospherical scoring rule and its divergence have a more natural utility-theoretic interpretation than the weighted power versions.

• Values of between 0 and 1 appear to be the most interesting, and the cases = 0 and = ½ have been so far under-explored.

Scoring Rules, Generalized Entropy, and Utility Maximization Robert Nau Fuqua School of Business...

Documents

Transcript of Scoring Rules, Generalized Entropy, and Utility Maximization Robert Nau Fuqua School of Business...