Lossless compression and cumulative log loss · Horse-race betting I You go to the horse races with...
Transcript of Lossless compression and cumulative log loss · Horse-race betting I You go to the horse races with...
log loss
Lossless compressionand
cumulative log loss
Yoav Freund
January 13, 2014
log loss
OutlineLossless data compressionThe guessing gameArithmetic codingThe performance of arithmetic codingSource entropyOther properties of log loss
Unbiased predictionOther examples for using log loss
universal codingTwo part codesCombining expert advice for cumulative log loss
Combining experts in the log loss frameworkThe online Bayes AlgorithmThe performance bound
log loss
Lossless data compression
The source compression problemI Example: “There are no people like show people”
encode→ x ∈ {0,1}ndecode→ “there are no people like show people”
I Lossless: Message reconstructed perfectly.I Goal: minimize expected length E(n) of coded message.I Can we do better than dlog2(26)e = 5 bits per character?I Basic idea: Use short codes for common messages.I Stream compression:
I Message revealed one character at a time.I Code generated as message is revealed.I Decoded message is constructed gradually.
I Easier than block codes when processing long messages.I A natural way for describing a distribution.
log loss
The guessing game
The Guessing game
I Message reveraled one character at a timeI An algorithm predicts the next character from the revealed
part of the message.I If algorithm wrong - ask for next guess.I Example
t6
h2
e1
r2
e1 1
a5
r2
e1 1
n4
o1 1
p5
e3
I Code = sequence of number of mistakes.I To decode use the same prediction algorithm
log loss
Arithmetic coding
Arithmetic Coding (background)
I Refines the guessing game:I In guessing game the predictor chooses order over
alphabet.I In arithmetic coding the predictor chooses a Distribution
over alphabet.I First discovered by Elias (MIT).I Invented independently by Rissanen and Pasco in 1976.I Widely used in practice.
log loss
Arithmetic coding
Arithmetic Coding (basic idea)
I Easier notation: represent characters by numbers1 ≤ ct ≤ |Σ|. (English: N .
= |Σ| = 26)I message-prefix c1, c2, . . . , ct−1 represented by line
segment [lt−1,ut−1)
I Initial segment [l0,u0) = [0,1)
I After observing c1, c2, . . . , ct−1, predictor outputsp (ct = 1 |c1, c2, . . . , ct−1 ), . . . ,p (ct = |Σ| |c1, c2, . . . , ct−1 ),
I Distribution is used to partition [lt−1,ut−1) into |Σ|sub-segments.
I next character ct determines [lt ,ut )
I Code = discriminating binary expansion of a point in [lt ,ut ).
log loss
Arithmetic coding
Arithmetic Coding (sequence example)
I Simplest case.I Σ = {0,1}I ∀t ,
p(ct = 0) = 1/3pt (ct = 1) = 2/3
I Message = 1111I Code = 111I Technical:
Assume decoderknows that lengthof message is 4. 0
1/3
1
1
0
0
1/3
1
1
0
1
0
1
0
0
1/3
1
1
0
1
0
1
0
1
0
1
0
1
0
1
00
1/3
1
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1/3
1
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1/3
1
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
00000001001000110100010101100111100010011010101111001101
11111
11110
log loss
The performance of arithmetic coding
The code length for arithmetic codingI Given m bits of binary expansion we assume the rest are
all zero.I Distance between two m bit expansions is 2−m
I If lT − uT ≥ 2−m then there must be a point x described bym expansion bits such that lT ≤ x < uT
I Required number of bits is d− log2(uT − lT )e.I uT − lT =
∏Tt=1 p (ct |c1, c2, . . . , ct−1 )
.= p(c1, . . . cT )
I Number of bits required to code c1, c2, . . . , cT isd−∑T
t=1 log2 pt (ct )e.I We call −
∑Tt=1 log2 pt (ct ) = −log2p(c1, . . . cT ) the
Cumulative log lossI Holds for all sequences.
log loss
Source entropy
Expected code lengthI Fix the messsage length TI Suppose the message is generated at random according
to the distribution p(c1, . . . cT )I Then the expected code length is∑
c1,...cT
p(c1, . . . cT )d− log2 p(c1, . . . , cT )e
≤ 1−∑
c1,...cT
p(c1, . . . cT ) log2 p(c1, . . . , cT ).
= 1 + H(pT )
I H(pT ) is the entropy of the distribution over sequences oflength T :
H(pT ).
=∑
(c1,...,cT )
p(c1, . . . , cT ) log1
p(c1,dots, cT )
I Entropy is the expected value of the cumulative log losswhen the messages are generated according to thepredictive distribution.
log loss
Source entropy
Shannon’s lower bound
I Assume pT is “well behaved”. For example, IID.I Let T →∞I H(p)
.= limT→∞
H(pT )T exists and is called the per character
entropy of the source pI The expected code length for any coding scheme is at
least(1− o(1))H(pT ) = (1− o(1)) T H(p)
I The proof of Shannon’s lower bound is not trivial (Can be astudent lecture).
log loss
Other properties of log loss
Unbiased prediction
log loss encourages unbiased prediction
I Suppose the source is random and the probability of thenext outcome is p (ct |c1, c2, . . . , ct−1 )
I Then the prediction that minimizes the log loss isp (ct |c1, c2, . . . , ct−1 ).
I Note that when minimizing expected number of mistakes,the best prediction in this situation is to put all of theprobability on the most likely outcome.
I There are other losses with this property, for example,square loss.
log loss
Other properties of log loss
Other examples for using log loss
Monthly bonuses for a weather forecasterI Before the first of the month assign one dollar to the
forecaster’s bonus. b0 = 1I Forecaster assigns probability pt to rain on day t .I If it rains on day t then bt = 2bt−1ptI If it does not rain on day t then bt = 2bt−1(1− pt )I At the end of the month, give forecaster bTI Risk averse strategy: Setting pt = 1/2 for all days,
guarantees bT = 1I High risk prediction: Setting pt ∈ {0,1} results in Bonus
bT = 2T if always correct, zero otherwise.I If forecaster predicts with the true probabilities then
E(log bT ) = T − H(pT )
and that is the maximal expected value for E(log bT )
log loss
Other properties of log loss
Other examples for using log loss
Horse-race betting
I You go to the horse races with one dollar b0 = 1I m horses compete in each race.I Before each race, the odds for each horse are announced:
ot (1), . . .ot (m) (arbitrary positive numbers)I You have to divide all your money among the different
horses.∑t
j=1 p̂t (j) = 1I The horse 1 ≤ yt ≤ m is winner of the t th race.I After iteration t , you have bt = bt−1p̂t (yt )ot (yt ) dollarsI After n races, you have bn =
∏nt=1 p̂t (yt )ot (yt ) dollars.
I Taking logs, we get cumulative log loss.
log loss
universal coding
“Universal” coding
I Suppose there are N alternative predictors / experts.I We would like to code almost as well as the best predictor.I We would like to make almost as much money as the best
expert in hind-site.
log loss
universal coding
Two part codes
Two part codes
I Send the index of the coding algorithm before themessage.
I Requires log2 N additional bits.I Requires the encoder to make two passes over the data.I Is the key idea of MDL (Minimal Description Length)
modeling.I Good prediction model = model that minimizes the total
code lengthI Often inappropriate because based on lossless coding.
Lossy coding often more appropriate.
log loss
universal coding
Combining expert advice for cumulative log loss
Combining predictors adaptively
I Treat each of the predictors as an “expert”.I Assign a weight to each expert and reduce it if expert
performs poorly.I Combine expert predictions according to their weights.I Would require only a single pass. Truly online.I Goal: Total loss of algorithm minus loss of best predictor
should be at most log2 N
log loss
Combining experts in the log loss framework
The log-loss framework
I Algorithm A predicts a sequence c1, c2, . . . , cT overalphabet Σ = {1,2, . . . , k}
I The prediction for the ct th is a distribution over Σ:pt
A = 〈ptA(1),pt
A(2), . . . ,ptA(k)〉
I When ct is revealed, the loss we suffer is − log ptA(ct )
I The cumulative log loss, which we wish to minimize, isLA
T = −∑T
t=1 log ptA(ct )
I dLAT e is the code length if A is combined with arithmetic
coding.
log loss
Combining experts in the log loss framework
The game
I Prediction algorithm A has access to N experts.I The following is repeated for t = 1, . . . ,T
I Experts generate predictive distributions: pt1, . . . ,p
tN
I Algorithm generates its own prediction ptA
I ct is revealed.I Goal: minimize regret:
−T∑
t=1
log ptA(ct ) + min
i=1,...,N
(−
T∑t=1
log pti (c
t )
)
log loss
The online Bayes Algorithm
The online Bayes AlgorithmI Total loss of expert i
Lti = −
t∑s=1
log psi (cs); L0
i = 0
I Weight of expert i
w ti = w1
i e−Lt−1i = w1
i
t−1∏s=1
psi (cs)
I Freedom to choose initial weights.w1
t ≥ 0,∑n
i=1 w1i = 1
I Prediction of algorithm A
ptA =
∑Ni=1 w t
i pti∑N
i=1 w ti
log loss
The online Bayes Algorithm
The Hedge(η)AlgorithmConsider action i at time t
I Total loss:
Lti =
t−1∑s=1
`si
I Weight:w t
i = w1i e−ηLt
i
Note freedom to choose initial weight (w1i )∑n
i=1 w1i = 1.
I η > 0 is the learning rate parameter. Halving: η →∞I Probability:
pti =
w ti∑N
j=1 w ti
, pt =wt∑N
j=1 w ti
log loss
The performance bound
Cumulative loss vs. Final total weight
Total weight: W t .=∑N
i=1 w ti
W t+1
W t =
∑Ni=1 w t
i elog pti (c
t )∑Ni=1 w t
i
=
∑Ni=1 w t
i pti (c
t )∑Ni=1 w t
i
= ptA(ct )
− logW t+1
W t = − log ptA(ct )
− log W T+1 = − logW T+1
W 1 = −T∑
t=1
log ptA(ct ) = LT
A
EQUALITY not bound!
log loss
The performance bound
Simple Bound
I Use uniform initial weights w1i = 1/N
I Total Weight is at least the weight of the best expert.
LTA = − log W T+1 = − log
N∑i=1
wT+1i
= − logN∑
i=1
1N
e−LTi = log N − log
N∑i=1
e−LTi
≤ log N − log maxi
e−LTi = log N + min
iLT
i
I Dividing by T we get LTA
T = miniLT
iT + log N
T
log loss
The performance bound
Upper bound on∑N
i=1 wT +1i for Hedge(η)
Lemma (upper bound)For any sequence of loss vectors `1, . . . , `T we have
ln
(N∑
i=1
wT+1i
)≤ −(1− e−η)LHedge(η).
log loss
The performance bound
Tuning η as a function of T
I trivially mini Li ≤ T , yielding
LHedge(η) ≤ mini
Li +√
2T ln N + ln N
I per iteration we get:
LHedge(η)
T≤ min
i
Li
T+
√2 ln N
T+
ln NT
log loss
The performance bound
Bound better than for two part codesI Simple bound as good as bound for two part codes (MDL)
but enables online compressionI Suppose we have K copies of each expert.I Two part code has to point to one of the KN experts
LA ≤ log NK + mini LTi = log NK + mini LT
iI If we use Bayes predictor + arithmetic coding we get:
LA = − log W T+1 ≤ log K maxi
1NK
e−LTi = log N + min
iLT
i
I We don’t pay a penalty for copies.I More generally, the regret is smaller if many of the experts
perform well.