Lossless compression and cumulative log loss · Horse-race betting I You go to the horse races with...

log loss

Lossless compressionand

cumulative log loss

Yoav Freund

January 13, 2014

log loss

OutlineLossless data compressionThe guessing gameArithmetic codingThe performance of arithmetic codingSource entropyOther properties of log loss

Unbiased predictionOther examples for using log loss

universal codingTwo part codesCombining expert advice for cumulative log loss

Combining experts in the log loss frameworkThe online Bayes AlgorithmThe performance bound

log loss

Lossless data compression

The source compression problemI Example: “There are no people like show people”

encode→ x ∈ {0,1}ndecode→ “there are no people like show people”

I Lossless: Message reconstructed perfectly.I Goal: minimize expected length E(n) of coded message.I Can we do better than dlog2(26)e = 5 bits per character?I Basic idea: Use short codes for common messages.I Stream compression:

I Message revealed one character at a time.I Code generated as message is revealed.I Decoded message is constructed gradually.

I Easier than block codes when processing long messages.I A natural way for describing a distribution.

log loss

The guessing game

The Guessing game

I Message reveraled one character at a timeI An algorithm predicts the next character from the revealed

part of the message.I If algorithm wrong - ask for next guess.I Example

t6

h2

e1

r2

e1 1

a5

r2

e1 1

n4

o1 1

p5

e3

I Code = sequence of number of mistakes.I To decode use the same prediction algorithm

log loss

Arithmetic coding

Arithmetic Coding (background)

I Refines the guessing game:I In guessing game the predictor chooses order over

alphabet.I In arithmetic coding the predictor chooses a Distribution

over alphabet.I First discovered by Elias (MIT).I Invented independently by Rissanen and Pasco in 1976.I Widely used in practice.

log loss

Arithmetic coding

Arithmetic Coding (basic idea)

I Easier notation: represent characters by numbers1 ≤ ct ≤ |Σ|. (English: N .

= |Σ| = 26)I message-prefix c1, c2, . . . , ct−1 represented by line

segment [lt−1,ut−1)

I Initial segment [l0,u0) = [0,1)

I After observing c1, c2, . . . , ct−1, predictor outputsp (ct = 1 |c1, c2, . . . , ct−1 ), . . . ,p (ct = |Σ| |c1, c2, . . . , ct−1 ),

I Distribution is used to partition [lt−1,ut−1) into |Σ|sub-segments.

I next character ct determines [lt ,ut )

I Code = discriminating binary expansion of a point in [lt ,ut ).

log loss

Arithmetic coding

Arithmetic Coding (sequence example)

I Simplest case.I Σ = {0,1}I ∀t ,

p(ct = 0) = 1/3pt (ct = 1) = 2/3

I Message = 1111I Code = 111I Technical:

Assume decoderknows that lengthof message is 4. 0

1/3

1

1

0

0

1/3

1

1

0

1

0

1

0

0

1/3

1

1

0

1

0

1

0

1

0

1

0

1

0

1

00

1/3

1

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

0

1/3

1

1

0

1

0

1

0

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

0

1

0

1

0

1

0

1

1

0

1/3

1

1

0

1

0

1

0

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

0

1

0

1

0

1

0

00000001001000110100010101100111100010011010101111001101

11111

11110

log loss

The performance of arithmetic coding

The code length for arithmetic codingI Given m bits of binary expansion we assume the rest are

all zero.I Distance between two m bit expansions is 2−m

I If lT − uT ≥ 2−m then there must be a point x described bym expansion bits such that lT ≤ x < uT

I Required number of bits is d− log2(uT − lT )e.I uT − lT =

∏Tt=1 p (ct |c1, c2, . . . , ct−1 )

.= p(c1, . . . cT )

I Number of bits required to code c1, c2, . . . , cT isd−∑T

t=1 log2 pt (ct )e.I We call −

∑Tt=1 log2 pt (ct ) = −log2p(c1, . . . cT ) the

Cumulative log lossI Holds for all sequences.

log loss

Source entropy

Expected code lengthI Fix the messsage length TI Suppose the message is generated at random according

to the distribution p(c1, . . . cT )I Then the expected code length is∑

c1,...cT

p(c1, . . . cT )d− log2 p(c1, . . . , cT )e

≤ 1−∑

c1,...cT

p(c1, . . . cT ) log2 p(c1, . . . , cT ).

= 1 + H(pT )

I H(pT ) is the entropy of the distribution over sequences oflength T :

H(pT ).

=∑

(c1,...,cT )

p(c1, . . . , cT ) log1

p(c1,dots, cT )

I Entropy is the expected value of the cumulative log losswhen the messages are generated according to thepredictive distribution.

log loss

Source entropy

Shannon’s lower bound

I Assume pT is “well behaved”. For example, IID.I Let T →∞I H(p)

.= limT→∞

H(pT )T exists and is called the per character

entropy of the source pI The expected code length for any coding scheme is at

least(1− o(1))H(pT ) = (1− o(1)) T H(p)

I The proof of Shannon’s lower bound is not trivial (Can be astudent lecture).

log loss

Other properties of log loss

Unbiased prediction

log loss encourages unbiased prediction

I Suppose the source is random and the probability of thenext outcome is p (ct |c1, c2, . . . , ct−1 )

I Then the prediction that minimizes the log loss isp (ct |c1, c2, . . . , ct−1 ).

I Note that when minimizing expected number of mistakes,the best prediction in this situation is to put all of theprobability on the most likely outcome.

I There are other losses with this property, for example,square loss.

log loss


Other examples for using log loss

Monthly bonuses for a weather forecasterI Before the first of the month assign one dollar to the

forecaster’s bonus. b0 = 1I Forecaster assigns probability pt to rain on day t .I If it rains on day t then bt = 2bt−1ptI If it does not rain on day t then bt = 2bt−1(1− pt )I At the end of the month, give forecaster bTI Risk averse strategy: Setting pt = 1/2 for all days,

guarantees bT = 1I High risk prediction: Setting pt ∈ {0,1} results in Bonus

bT = 2T if always correct, zero otherwise.I If forecaster predicts with the true probabilities then

E(log bT ) = T − H(pT )

and that is the maximal expected value for E(log bT )

log loss


Other examples for using log loss

Horse-race betting

I You go to the horse races with one dollar b0 = 1I m horses compete in each race.I Before each race, the odds for each horse are announced:

ot (1), . . .ot (m) (arbitrary positive numbers)I You have to divide all your money among the different

horses.∑t

j=1 p̂t (j) = 1I The horse 1 ≤ yt ≤ m is winner of the t th race.I After iteration t , you have bt = bt−1p̂t (yt )ot (yt ) dollarsI After n races, you have bn =

∏nt=1 p̂t (yt )ot (yt ) dollars.

I Taking logs, we get cumulative log loss.

log loss

universal coding

“Universal” coding

I Suppose there are N alternative predictors / experts.I We would like to code almost as well as the best predictor.I We would like to make almost as much money as the best

expert in hind-site.

log loss

universal coding

Two part codes

Two part codes

I Send the index of the coding algorithm before themessage.

I Requires log2 N additional bits.I Requires the encoder to make two passes over the data.I Is the key idea of MDL (Minimal Description Length)

modeling.I Good prediction model = model that minimizes the total

code lengthI Often inappropriate because based on lossless coding.

Lossy coding often more appropriate.

log loss

universal coding

Combining expert advice for cumulative log loss

Combining predictors adaptively

I Treat each of the predictors as an “expert”.I Assign a weight to each expert and reduce it if expert

performs poorly.I Combine expert predictions according to their weights.I Would require only a single pass. Truly online.I Goal: Total loss of algorithm minus loss of best predictor

should be at most log2 N

log loss

Combining experts in the log loss framework

The log-loss framework

I Algorithm A predicts a sequence c1, c2, . . . , cT overalphabet Σ = {1,2, . . . , k}

I The prediction for the ct th is a distribution over Σ:pt

A = 〈ptA(1),pt

A(2), . . . ,ptA(k)〉

I When ct is revealed, the loss we suffer is − log ptA(ct )

I The cumulative log loss, which we wish to minimize, isLA

T = −∑T

t=1 log ptA(ct )

I dLAT e is the code length if A is combined with arithmetic

coding.

log loss

Combining experts in the log loss framework

The game

I Prediction algorithm A has access to N experts.I The following is repeated for t = 1, . . . ,T

I Experts generate predictive distributions: pt1, . . . ,p

tN

I Algorithm generates its own prediction ptA

I ct is revealed.I Goal: minimize regret:

−T∑

t=1

log ptA(ct ) + min

i=1,...,N

(−

T∑t=1

log pti (c

t )

)

log loss

The online Bayes Algorithm

The online Bayes AlgorithmI Total loss of expert i

Lti = −

t∑s=1

log psi (cs); L0

i = 0

I Weight of expert i

w ti = w1

i e−Lt−1i = w1

i

t−1∏s=1

psi (cs)

I Freedom to choose initial weights.w1

t ≥ 0,∑n

i=1 w1i = 1

I Prediction of algorithm A

ptA =

∑Ni=1 w t

i pti∑N

i=1 w ti

log loss

The online Bayes Algorithm

The Hedge(η)AlgorithmConsider action i at time t

I Total loss:

Lti =

t−1∑s=1

`si

I Weight:w t

i = w1i e−ηLt

i

Note freedom to choose initial weight (w1i )∑n

i=1 w1i = 1.

I η > 0 is the learning rate parameter. Halving: η →∞I Probability:

pti =

w ti∑N

j=1 w ti

, pt =wt∑N

j=1 w ti

log loss

The performance bound

Cumulative loss vs. Final total weight

Total weight: W t .=∑N

i=1 w ti

W t+1

W t =

∑Ni=1 w t

i elog pti (c

t )∑Ni=1 w t

i

=

∑Ni=1 w t

i pti (c

t )∑Ni=1 w t

i

= ptA(ct )

− logW t+1

W t = − log ptA(ct )

− log W T+1 = − logW T+1

W 1 = −T∑

t=1

log ptA(ct ) = LT

A

EQUALITY not bound!

log loss


Simple Bound

I Use uniform initial weights w1i = 1/N

I Total Weight is at least the weight of the best expert.

LTA = − log W T+1 = − log

N∑i=1

wT+1i

= − logN∑

i=1

1N

e−LTi = log N − log

N∑i=1

e−LTi

≤ log N − log maxi

e−LTi = log N + min

iLT

i

I Dividing by T we get LTA

T = miniLT

iT + log N

T

log loss


Upper bound on∑N

i=1 wT +1i for Hedge(η)

Lemma (upper bound)For any sequence of loss vectors `1, . . . , `T we have

ln

(N∑

i=1

wT+1i

)≤ −(1− e−η)LHedge(η).

log loss


Tuning η as a function of T

I trivially mini Li ≤ T , yielding

LHedge(η) ≤ mini

Li +√

2T ln N + ln N

I per iteration we get:

LHedge(η)

T≤ min

i

Li

T+

√2 ln N

T+

ln NT

log loss


Bound better than for two part codesI Simple bound as good as bound for two part codes (MDL)

but enables online compressionI Suppose we have K copies of each expert.I Two part code has to point to one of the KN experts

LA ≤ log NK + mini LTi = log NK + mini LT

iI If we use Bayes predictor + arithmetic coding we get:

LA = − log W T+1 ≤ log K maxi

1NK

e−LTi = log N + min

iLT

i

I We don’t pay a penalty for copies.I More generally, the regret is smaller if many of the experts

perform well.

Lossless compression and cumulative log loss · Horse-race betting I You go to the horse races with...

Documents

Transcript of Lossless compression and cumulative log loss · Horse-race betting I You go to the horse races with...