1 Parametric context adaptive Laplace distribution for ... · Parametric context adaptive Laplace...

1

Parametric context adaptive Laplace distributionfor multimedia compression

Jarek DudaJagiellonian University, Golebia 24, 31-007 Krakow, Poland, Email: [email protected]

Abstract—Data compression often subtracts prediction andencodes the difference (residue) e.g. assuming Laplace distribu-tion, for example for images, videos, audio, or numerical data.Its performance is strongly dependent on the proper choice ofwidth (scale parameter) of this parametric distribution, can beimproved if optimizing it based on local situation like context.For example in popular LOCO-I [1] (JPEG-LS) lossless imagecompressor there is used 3 dimensional context quantized into365 discrete possibilities treated independently. This article dis-cusses inexpensive approaches for exploiting their dependencieswith autoregressive ARCH-like context dependent models forparameters of parametric distribution for residue, also evolvingin time for adaptive case. For example tested such 4 or 11parameter models turned out to provide similar performanceas 365 parameter LOCO-I model for 48 tested images. Besidesmaller headers, such reduction of number of parameters canlead to better generalization. In contrast to context quantizationapproaches, parameterized models also allow to directly usehigher dimensional contexts, for example using informationfrom all 3 color channels, further pixels, some additional regionclassifiers, or from interleaving multi-scale scanning - for whichthere is proposed Haar upscale scan combining advantagesof Haar wavelets with possibility of scanning exploiting localcontexts.

Keywords: data compression, LOCO-I, parametric dis-tribution, context dependence, non-stationary time series,multi-scale scanning

I. INTRODUCTION

Many types of data statistically agree with specific para-metric distributions, like Gaussian distribution through thelaw of large numbers, or Laplace distribution popular indata compression as it agrees with statistics of errors fromprediction (residues). Their parameters can often be inexpen-sively estimated, and storing them in a header is much lessexpensive than e.g. entire probability distribution on somequantized set of represented values. Parametric distributionssmoothen between discretized possibilities, generalizing sta-tistical trends emerging in a given type of data.

However, for example due to randomness alone, statisticsof real data usually has some distortion from such idealiza-tion. Directly storing counted frequencies can exploit thisdifference, gaining asymptotically Kullback-Leibler diver-gence bits/value - at cost of larger header. Data compres-sors need to optimize this minimum description length [2]tradeoff between model size and entropy it leads to.

Figure 1. Comparison of some discussed models for 48 grayscale 8 bit512x512 images presented in Fig. 2. Top left: first we need to predictpixel value based on the current context: already decoded 4 neighboringpixels c = (A,B,C,D). This predicted µ(c) is used as the center ofLaplace distribution, which is estimated as median: minimizes l1 distance.Hence, presented evaluation uses average |x − µ(c)| for 4 approaches:LOCO-I predictor (red), simple average (green), least squares parameters forcombined images (orange), and least squares parameters chosen individuallyfor each image (blue) - the last one gives the lowest residues so it is usedfurther. Top right: bits/pixel for encoding its residues (r = x − µ(c))using centered (µ = 0) Laplace distribution of width (scale parameter)b modeled in various ways. Red: LOCO-I model with 365 parameterscorresponding to quantized context: (|C −A|, |B −C|, |D−B|). Green:single b chosen individually (MLE) for each image. Orange: discussed here4 parameter model, written at the bottom left, blue: discussed later 11parameter model. Bottom: differences of these values for the two models.The evaluation assumes accurate entropy coding (AC/ANS) and neglectsheaders - including them would worsen especially LOCO-I evaluation ifstoring all 365 parameters.

In practice, instead of a single e.g. Laplace distributionto encode residues (errors of predictions) for the entireimage, we would like to make its parameters dependent onlocal situation - through context dependence like in Markovmodelling, or adaptivity as for non-stationary time series.

The possibility to directly store all values fades awaywhen increasing dimension of the model - both dueto size growing exponentially with dimension, but alsounderrepresentation. Going to higher dimensions requiresfinding and exploiting some general behaviour, for examplethrough parametrizations, as in examples presented in Fig. 1.

LOCO-I[1] mixes both philosophies: uses parametricprobability distributions, which scale parameter (width of

arX

iv:1

906.

0323

8v4

[ee

ss.I

V]

14

Oct

201

9

2

Laplace distribution) depends on 3 dimensional contextquantized into 365 possibilities treated independently - ne-glecting their dependencies. Such approach is useful forlow dimensional contexts, however, it becomes impracticalif wanting to use higher dimensional context, e.g.: usinginformation from all 3 color channels, further pixels thanthe nearest neighbors, or from some region classifiers togradually transit between e.g. models for smooth regionslike sky, to complex textures like treetop. Finally contexts ofmuch higher dimension appear in multiscale interlaced scan-ning like in FLIF [3] compressor: progressively improvingquality, rather only parametric models can directly work onits high dimensional contexts.

This article discusses such parametric-parametric models:choose parameters of e.g. Laplace distribution as a paramet-ric function of the context, like through a linear combination,or generally e.g. neural networks. Its example are ARMA-ARCH [4] models popular in economics: choosing squaredwidth of Gaussian distribution as a linear combination ofrecent squared residues, e.g. σ2t = β0 + β1�

2t−1.

These parameters can be universal e.g. default for varioustypes of classified regions, or optimized individually bycompressor and stored in the header. For the latter purposewe will focus on least squares estimation due to its lowcost. Presented test results are for such estimation, a costlyadditional optimization might slightly improve performance.

While we will mostly focus on such static models: assum-ing constant joint distribution of (value, context), mentionedalternative are adaptive models: assuming non-stationarytime series, evolving joint distribution. It requires additionalcost to update parameters of the model, for example per-forming likelihood optimization step while processing eachvalue. It has two advantages: can learn model from alreadydecoded data even without header, and can flexibly adapt tolocal behavior e.g. of an image. Appendix discusses secondorder approaches for such online optimization.

In literature there are also considered much more costlymodels, like using neural networks for predicting probabilitydistribution of succeeding pixels ([5], [6]). In the discussedphilosophy, instead of directly predicting probability of eachdiscrete value e.g. with softmax, we can use such neuralnetworks to directly predict context dependent parametersof some parametric distribution for the new pixel. Suchsimplification should allow to use much smaller neuralnetworks, bringing it closer to practical application in datacompression.

II. PARAMETRIC-PARAMETRIC DISTRIBUTIONS

We would like to model conditional probability distribu-tion Pr(x|c) of the new value x ∈ R, based on some local d-dimensional context c = (c1, . . . , cd) ∈ C ⊂ Rd, in practicebounded e.g. to a cube like C = [0, 1]d here. In LOCO-I image compressor this context are 4 neighboring alreadydecoded pixels (c = (A,B,C,D) as in Fig. 1). Both value

Figure 2. Dataset of 48 grayscale 8 bit 512x512 images used in tests.Source: http://decsai.ugr.es/cvg/CG/base.htm .

and context are rather discrete through some quantization,but it is useful to model them as real values - especiallywanting to exploit continuity of their behavior.

Modelling general continuous conditional distributions isa difficult task - requires techniques like quantile regres-sion [7]. or hierarchical correlation reconstruction [8], [9].However, the situation becomes much simpler if focusingon simple parametric distributions for the predicted distribu-tion. Another standard simplification is separately modellingthe center of the distribution with predictor µ(c), and theremaining parameter(s) θ(c) of centered distribution forr = x−µ(c) residue, usually single scale parameter definingwidth:

r = x− µ(c) residue from ρθ(c) density (1)

We will mainly focus on standard for such applicationsLaplace distribution and modeling its width parameter b:

ρµb(x) =1

2bexp

(−|x− µ|

b

)ρb(r) =

1

2bexp

(−|r|b

)(2)

which MLE parameters for (x1, . . . , xn) sample are:

µ = median of {xi} b = 1n

n∑i=1

|xi − µ| (3)

http://decsai.ugr.es/cvg/CG/base.htm

3

LOCO-I has a fixed specialized predictor. Then chooseswidth parameter θ(c) ≡ b(c) as locally constant inside 365regions for quantized |C − A|, |B − C|, |D − B| context,each into 9 ranges of nearly equal population. This way wecan perform estimation independently for each region, andfinally e.g. store in the header the 365 parameters.

Quantization of context neglects dependencies betweenthese regions and can be practical rather only for low di-mensional contexts - both due to the number of possibilitiesgrowing exponentially with dimension, but also underrepre-sentation of many such contexts. To resolve it, we will focushere on parameterized models for these parameters:

µ(c) ≡ µα(c) for α ∈ Rdα predictor

θ(c) ≡ θβ(c) for β ∈ Rdβ e.g. scale parameter

Choosing µα(c) and θβ(c) family of functions optimized fora given type of problems is a difficult question. Like ARCH,unlike LOCO-I, we will focus on using linear combinationsof some chosen f, g functions:

µα(c) = α1f1(c) + α2f2(c) + . . .+ αdα fdα(c) (4)

θβ(c) = β1g1(c) + β2g2(c) + . . .+ βdβ gdβ (c) (5)

The latter might need additional e.g. max(θ, 0.001) if posi-tive values are required and some of β are negative. We canalternatively use more sophisticated nonlinear models likeneural networks.

A. Context dependence

Choosing some µα(c) and θβ(c) family of functions,we can optimize α, β (or e.g. neural network parameters)for given (x1, . . . , xn) values and (c1, . . . , cn) contexts, forexample maximizing likelihood (MLE):

(α, β) = argminα,β

n∑i=1

log(ρθβ(ci)(x

i − µα(ci)))

(6)

To simplify this optimization at cost of suboptimality, wecan split it into predictor and the remaining as in Fig. 1.

This way we can first optimize parameters of predictore.g. using some distance d:

α = argminα

n∑i=1

d(xi, µα(ci)) (7)

for example using d(x, y) = (x− y)2 least squares distancewe are looking for predictor of expected value - appropriatee.g. for Gaussian distribution (or polynomial coefficients in[9]). For Laplace distribution it is more appropriate to used(x, y) = |x − y| for predictor of median. However, unlessheavy tails case, optimization of both gives nearly the samepredictor, so it is safe to use least squares optimization whichis computationally less expensive.

Having optimized predictor, we can calculate residuesri = xi − µα(ci) and separately optimize β using them.

Especially for scale parameter, MLE estimator is oftenaverage over some simple function of values, for exampleb = average |r| for Laplace distribution (θ ≡ b), σ2 =average r2 for Gaussian distribution (θ ≡ σ2), or generallyaverage |r|κ for exponential power distribution (θ ≡ bκ).Average is estimator of expected value, what allows forpractical optimization of β using least squares (analogouslye.g. for neural networks):

β = argminβ

n∑i=1

(|ri| − θβ(ci)

)2for Laplace: θ ≡ b (8)

β = argminβ

n∑i=1

((ri)2 − θβ(ci)

)2for Gaussian: θ ≡ σ2

Such parameters can be optimized for a dataset, forexample for different regions using some segmentation, andthen used as default. Alternatively, compressor can optimizethem individually e.g. for a given image and store parametersin the header.

B. Adaptivity

Instead of storing model parameters in the header, alterna-tive approach is starting from some default parameters andadapting them based on the processed data, also for betteragreement with varying local statistics e.g. of an image.Such adaptation brings additional cost, dependence on localsituation can be alternatively realized by using some regionclassifier/segmentation and separate models for each class, orusing outcome of such local classifier as additional context- choosing the best tradeoffs is a difficult question.

For adaptation we can treat the upper index as timeand use time dependent parameters starting from somee.g. default initial choice for t = 0. For example withoutcontext dependence, we could just replace average withexponential moving average for Laplace distribution andsome η, ν ∈ (0, 1) learning rates:

µt+1 = νµt + (1− ν)xt bt+1 = ηbt + (1− η)|xt − µt|

Generally we could use for example gradient descentwhile processing each value to optimize parameters towardlocal statistics for combined (α, β) using (6), or in split form:

rt = xt − µαt(ct) residue from ρθβt density

αt+1 = αt − ηα∂d(xt, µα(c

t))

∂α(αt)

βt+1 = βt + ηβ∂ log(ρθβ(ct)(r

t))

∂β(βt) (9)

where d is distance as previously. For β the above gradientascend optimizes likelihood, ηα, ηβ define adaptation rate.

Using first order method is not sufficient for a properchoice of step size, suggesting to use also second derivativeand Newton’s method (e.g. ∀i θt+1i = θti−∂if(θt)/∂iif(θt))- the Appendix discusses such general approaches.

4

C. Exponential power distribution

Data compression usually focuses on Laplace distribution,but real data might have a bit different statistics, especiallyheavier tails. It might be worth to consider more generalfamilies, especially exponential power distribution [10]:

ρκµb(x) =κ−1/κ

2 bΓ(1 + 1/κ)e−

1κ (|x−µ|b )

κ

(10)

It covers both Laplace (κ = 1) and Gaussian (κ = 2, b ≡ σ)distribution. Estimating κ is costly, but we can fix it basedon a large dataset and e.g. segment type. Then estimation ofµ, b is analogous, also for context dependence like in 8:

µ = argminµ

n∑i=1

|xi − µ|κ b =

(1

n

n∑i=1

|xi − µ|κ)1/κ

β = argminβ

n∑i=1

(|ri|κ − θβ(ci)

)2for θ ≡ bκ (11)

Here is a simple example of its adaptive estimation for η, ν ∈(0, 1) learning rates:

µt+1 = ν µt + (1− ν)xt

θt+1 = η θt + (1− η) |xt − µt|κ for θ ≡ bκ (12)

In data compression we can have prepared entropy codingtables for such fixed κ and some optimized discretized setof scale parameter b.

D. Adaptive least-squares linear regression∗

∗This subsection expands adaptivity to linear regression - forcompleteness and to connect some concepts, however, it might betoo costly for data compression and is not used further (yet).

Above (12) formula for b can be seen as obtained fromonline adaptive ML estimation: instead of standard ”static”estimation of constant parameters based on the entire sample,we perform ML estimation separately for every moment intime - using only its past information, weakening influenceof old values e.g. with exponential moving average. This waywe optimize parameters separately for every time, instead ofstandard: finding a single compromise for all of them.

Specifically, we get (12) formula for b if maximizing

lT =∑t

5

III. PRACTICAL LAPLACE EXAMPLE AND EXPERIMENTS

Let us now focus on LOCO-I lossless image compressionsetting: context are 4 already decoded neighboring pixels:c = (A,B,C,D) on correspondingly (left, up, left-up, right-up) positions as in diagram in Fig. 1.

A. Predictor µ(c)

LOCO-I uses a fixed predictor (c = (A,B,C,D)):

µ(c) =

min(A,B) if C ≥ max(A,B)max(A,B) if C ≤ min(A,B)A+B − C otherwise

(19)

Simpler popular choices are e.g. (A+B)/2 or A+B −C.A standard way for designing such predictors is polynomialinterpolation, e.g. in Lorenzo predictor [11]: fitting somepolynomial to the known values and calculating its value inthe predicted position, getting a linear combination.

We can also directly optimize it for a dataset. For exampleleast squares optimization using combined 48 images (Fig.2) gives (rounded to 2 digits, weights sum to 1):

µ(c) = 0.57A+ 0.48B − 0.2C + 0.15D

Alternatively, we can optimize these weights individuallyfor each image by compressor and store in the header -Fig. 1 contains comparison for various approaches using l1

distance as we would like to estimate median for Laplacedistribution. Such individual least squares optimization turnsout always superior there (blue points), LOCO-I predictor forsome images is much worse than the remaining.

Tested inexpensive least squares optimizer uses directlythe dα = 4 functions: f1(c) = A, f2(c) = B, f3(c) =C, f4(c) = D in 4 notation. We build n × dα matrix Pfrom them: Pij = fj(ci), and x = (x1, . . . , xn) vector. Thenthe optimal parameters are obtained using pseudo-inverse (asderived (15) for equal weights w):

α = argminα‖Pα− x‖22 = (P †P )−1P †x (20)

For further tests there were used residues from individualleast squares optimization for each image: r = x− Pα.

B. Context dependent scale parameter b(c)

Having the residues, LOCO-I would divide |C−A|, |B−C|, |D − B| into 9 ranges each, having nearly equal pop-ulation. Including symmetry it leads to division into (93 +1)/2 = 365 contexts. For each of them we independentlyestimate scale parameter b of Laplace distribution.

Here we would like to model b as a linear combination(5) of some functions (gj(c))j=1..dβ of the context. Thechoice of these functions is difficult and essentially affectscompression ratios. They should contain ”1” for the inter-cept term. Then, in analogy to LOCO-I, the considered 4

Figure 3. Top: probability density of b parameters for all images, LOCO-Iand discussed 4 parameter model, assuming the models are estimated andstored individually for each image. Three most characteristic images aremarked as their numbers. Bottom left: such densities if combining all imagesinto one - while huge LOCO-I number of parameters can usually learn betterindividual images than 4 parameter model, it has worse generalization - isinferior when combining different types of patterns. Bottom right: penaltyof using power-of-2 Golomb coding for various b parameters. We can get ≈2% improvement if switching to arithmetic coding or asymmetric numeralsystems, however, especially for LOCO-I it would require larger headersdue to needed better precision of b.

parameter model uses the following linear combination (forconvenience enumerated from 0):

b(c) = β0+β1|C−A|0.8+β2|B−C|0.8+β3|D−B|0.8 (21)

There is a freedom of choosing above power and em-pirically ≈ 0.8 has turned out to provide the best likeli-hood/compression ratio - corresponds well to linear behaviorof b. This choice leads to all the coefficients β turn outpositive in experiments - we have some initial β0 width,growing with increased gradients in the neighboring pixels.Hence there is no possibility of getting negative b this way,which would make no sense.

Having chosen such e.g. dβ = 4 functions, we build n×dβmatrix from them Sij = gj(ci), and residue vector |r| =(|r1|, . . . , |rn|). Then we can use least squares optimization:

β = argminβ‖Sβ − |r|‖22 = (S†S)−1S†|r| (22)

Figure 3 contains comparison of density of predictedscale parameters b for individual images (top) for LOCO-I approach and the above 4 parameter model - the latter issmoother as we could expect, but generally they have similarbehavior. Bottom left of this figure contains comparison forcombining all images, and compression ratios showing bettergeneralization of these low parameter models.

The second considered: dβ = 11 parameter model extendsabove basis by the following arbitrarily chosen 7 functions:symmetric describing intensity of neighboring pixels, andevaluating the second derivative:

(A− 0.5)4, (B − 0.5)4, (C − 0.5)4, (D − 0.5)4

|C − 2B +D|0.1, |A− 2C +B|0.1 (23)

6

where again powers were chosen empirically to get the bestlikelihood/compression ratio. In contrast to 4 parametermodel, this time we get also negative β coefficients, leadingto negative predicted b. To prevent that, there was finallyused max(b, 0.001) width of Laplace distribution.

The used functions were chosen arbitrarily by manualoptimization, some wider systematic search should improveperformance. For example in practical implementationsabove power functions would be rather put into tables, whatallows to use much more complex functions, like given bystored values on some quantized set of arguments. It wouldallow to carefully optimize such tabled functions based ona large set of images.

The above was for Laplace distribution. For more generalexponential power distribution, there should be used |r|κ in(22) instead of |r|, and the prediction Sβ like (21) gives bκ.

C. Entropy coding, penalty of Golomb coding

Laplace distribution is continuous, to encode values fromit we need to quantize it to approximately geometric distri-bution, which values are transformed into bits using someentropy coding.

LOCO-I uses power-of-2 Golomb coding: instead of real bcoefficient, it optimizes M = 2m parameter, then x is storedas bx/Mc using unary coding, and mod (x,M) is storeddirectly as bits. This way it requires 2bx/Mc+ 1 +m bitsto store unsigned x. Signed values are stored as position in0, 1,−1, 2,−2, . . . order.

Ideally, symbol of probability p carries log2(1/p) bitsof information, leading to asymptotically Shannon entropybits/symbol. Optimal parameter power-of-two Golomb cod-ing is worse by a few percents for used here b valuesas shown in Fig. 3. One reason is this sparse M = 2m

quantization of parameters. More important, especially forsmall b, is most of probability going to 0 quantized value,what can correspond to lower than 1 bit of informationalcontent. In contrast, prefix codes like Golomb need to useat least 1 bit per symbol.

Replacing power-of-2 Golomb coding with an accurateentropy coder like arithmetic coding (AC) or asymmetricnumeral systems (ANS), we can improve compression ratioby ≈ 2%. In this case we also need some quantization of bparameter - we can have prepared entropy coding tables forsome discredited space of possible parameters.

D. Multi-scale interleaving

In standard scanning line by line we have context onlyfrom half of the plane, only guessing what will happen fromthe decoded side. It can be improved in multi-scale interleav-ing, showing gains e.g. in FLIF [3] compressor, where wecan use lower resolution context from all directions due toprogressive decoding in multiple scans, like visualized inFig. 4.

However, we can see that context information becomesmuch more complex here: high dimensional, varying withthe scan number. Even reducing it by some arbitrary av-eraging, it is still rather too large for context quantizationapproaches like in LOCO-I. Discussed here parametric ap-proaches have no problem with direct use of such highdimensional contexts, modelling parameters as e.g. a linearcombination of a chosen family of functions, with param-eters chosen e.g. by inexpensive least squares optimizationand stored in the header. Alternatively more complex modelscan be used instead, like neural networks.

This Figure also proposes combination with Haar waveletsfor hopefully improved performance - splitting decodinginto k cycles, each improving resolution twice, and beingcomposed of a few scans, e.g. 3 for grayscale, or 9 for3 colors - each providing a single degree of freedom perblock for the upscaling. Such decomposition into e.g. 9 scansclearly leaves an opportunity for optimization, starting withthe choice of color transformation.

Assuming some scale invariance of images, similar mod-els can be used for different cycles here, for example we cantreat the number of cycle (defining scale) as an additionalparameter.

Figure 4. Top: conventional multi-scale interleaved scanning [5] (e.g.FLIF compressor [3]): scan over succeeding sub-lattices for progressivedecoding, and most importantly: to provide better local context for laterdecoded pixels. Bottom: proposed Haar upsample scanning which com-bines advantages of Haar wavelets [12] with exploitation of local contextdependence. First (scan 0) we decode low resolution image: averages over2k × 2k size blocks, using decoded neighboring block averages as thecontext. Then in each cycle (scan 1,2,3) we decode the 3 missing values(for grayscale, 9 for RGB) to improve the resolution twice: e.g. horizontaldifferences in scan 1, then vertical differences in two positions in scan 2 and3. After k such cycles we reach 1× 1 blocks - completely decoded image.The context of already decoded local information is high dimensional, ofdifferent type for each scan and level. While it is a problem for LOCO-I likecontext quantization, parametric models can easily handle it, for exampleusing µs(c) =

∑i α

si ci predictor, where s denotes the type of scan - its

parameters α can be inexpensively e.g. MSE optimized and stored in theheader. Some modification options are e.g. splitting values into higher andlower bits for separate scans [6], or using fractal-like (tame twindragon)blocks by modifying translation vectors for hexagonal block lattice [13].

7

IV. CONCLUSION AND FURTHER WORK

Parametric models allow to successfully exploit trendsin behavior, also for context dependence and evolution ofparametric distributions. Thanks to generalization, a few pa-rameter model can provide a better performance than treatingall possibilities as independent - neglecting dependencies be-tween them. Wanting to exploit higher dimensional contexts,e.g. for 3 colors, further pixels, region classifiers or multi-scale scanning, parametric models become a necessity as thenumber of discretized possibilities would grow exponentiallywith dimension.

There were presented and tested very basic possibili-ties, leaving many improvement opportunities, starting withchoice of contexts and functions, or using other parametricdistributions like exponential power distribution. Used leastsquares optimization is inexpensive enough to be used bycompressor to individually optimize parameters for each im-age. For example choosing some general default parameters,we can use better optimizers, like l1 for Laplace median, orgenerally MLE. These parameters can be alternatively opti-mized online, e.g. with discussed adaptive linear regression,however, it might be too costly for data compression.

Lossy image compressors have a different situation: cod-ing e.g. DCT transform coefficients, where distribution pa-rameters should be chosen also based on position - whichshould be included as a part of the context with someproperly chosen functions.

As we can see in Fig. 3, there is a large spread of behaviorof parameters, using individual models for separate imagesoften gives improvement. It suggests to try to segment theimage into regions of similar behavior, or use a regionclassifier. Having such segmentation mechanism optimizedfor a large dataset, with separate models for each segment,they could define default behavior, avoiding the need of sep-arate model estimation and storage. It would be valuable tooptimize such segmentation based on used family of models.Alternative approach is using classifiers and treating theirevaluation as part of the context, what would additionallyallow to continuously interpolate between classes.

Finally, while for low cost reason we were focused onlinear models for parameters, better compression ratios atlarger computational cost should be achievable using moregeneral models like neural networks. They are considered inliterature to directly predict discrete probability distributionsfor pixel values ([5], [6]). We could reduce the computationalcost if, based on the context, predicting only parametersof parametric distributions instead, then finally discretizingthe obtained distribution. For example Laplace distributionfor unimodal distributions, e.g. training neural network tominimize sum of |x − µ(c)| and (b(c) − |x − µ(c)|)2. Formore complex distributions like multimodal, we can e.g.parameterize density as polynomial, and train to minimizesum of squares of differences for coefficients of orthonormalpolynomials as in [9].

APPENDIX

The appendix discusses some general approaches foronline adaptation of parameters of models to optimize forlocal behavior of non-stationary time series using updatedsecond order local approximation of the optimized function.

For this purpose we should define a function we want tooptimize. This function needs to evolve in time to expresslocal situation we would like to optimize for. Recently ob-served values bring us information about this local situation- we can for example use log-likelihood for them to estimateevolution of parameters of probability distribution, reducingthe weights of the old observations to find the current localbehavior. It is convenient to use exponentially weakeningweights as in exponential moving average, leading e.g. to(12) adaptive estimation.

1) Second order online parameter optimization: Let uschoose example of such family of online optimized evolvingcriterion for (xt) series of observations - for time T + 1 as:

FT+1(θ) =∑t≤T

ηT−tf(xt, θ) = ηFT (θ) + f(xT , θ) (24)

using some coefficient η ∈ (0, 1) and f point-wise evalua-tion e.g. logarithm of density for log-likelihood. In practicethis sum is finite - requiring to choose some initial value forabove recurrence in exponential moving average.

In machine learning such objective/cost functions F (θ)are usually minimized, so we can use minus logarithm tooptimize log-likelihood, or some its approximation like firstfew terms of Taylor expansion:

f(x, θ) = − ln(ρθ(x)) (=∞∑k=1

(1− ρθ(x))k

k)

Now for online minimization of F , a natural assumption isthat in time T +1 we know θT minimizing FT , and want tofind θT+1 minimizing FT+1. To reduce cost, we would liketo slowly evolve parameters (θT+1 ≈ θT ), what generallyrequires a caution: might be suboptimal if F is not convex.

To approximate such preferable step θT+1 − θT , we canuse derivatives calculated with recurrence as in 24, e.g.:

∂θiFT+1 = η ∂θiF

T + ∂θif(xT , θ)

However, to work on values we would need to fix a point(θ∗) where such derivatives are taken. We could choose thispoint as some averaged parameters, and use its perturbedvalues based on online calculated first two derivatives inthis point e.g. using Newton method - multidimensional orseparately for each coordinate i:

∀i θTi = θ∗i − ∂θiFT (θ∗) / ∂θiθiFT (θ∗) (25)

For more general evolution of parameters we need to beable to shift this point of derivation, for example by updatingmodel in two points simultaneously, using the older one toget the actual parameters (25), and periodically replacingsuch used point with the new one, and starting building anew model for a recent θ∗ = θT point for derivations.

8

2) Adaptive minimization with online parabola model:To get a more continuous update of parameters, alternativeapproach might be treating (θt,∇θf(xt, θ)|θ=θt)t≤T as asequence of (value, noisy gradient) like in stochastic gradientdescent. Instead of directly using second derivative, we cansee neighboring minimum as where the gradient becomeszero - what can be estimated by finding linear trend ofgradients and calculating where this trend crosses zero.

Linear trend of gradients can be calculated in online wayby using least squares linear regression with exponentiallyweakening weights [14]. Let us present it in one-dimensionalcase e.g. to perform optimization separately for each θi pa-rameter, what seems sufficient for slowly evolving minimum.In multidimensional case it can alternatively be done usingHessian inversion as in Newton’s method - which is morecostly, but might be slightly better.

So let us focus on optimization for 1D parameter θ ∈ R,e.g. to be applied separately to each coordinate for multidi-mensional θ. Denote its values in successive times as θt andgt = ∂θf(x

t, θ)|θ=θt as corresponding history of gradients.Analogously to (14), to (θt, gt)t≤T sequence, we would

like to fit parabola f(θ) = h + 12λ(θ − p)2, optimizing

agreement of derivatives gt ≈ f ′(θt) for f ′(θ) = λ(θ − p)using exponentially weakening weights ηT−t:

arg minλ,p

∑t≤T

ηT−t(gt − λ(θt − p))2 (26)

This least squares linear regression leads ([14]) to λ as (θ, g)covariance divided by θ variance, and λ−1 learning rategradient descend for averaged position and gradient:

λ =gθ − g θθ2 − θ2

p = θ − λ−1 g (27)

For x, g, gθ, g2 exponential moving averages (η ∈ (0, 1)):

θT+1

= η θT

+ (1− η) θT

gT+1 = η gT + (1− η) gT

gθT+1

= η gθT

+ (1− η) gT θT

θ2T+1

= η θ2T

+ (1− η) (θT )2

Found p = θ − λ−1 g is modeled minimum if λ > 0.Seeing it as gradient descend (using averaged gradient andposition), we can e.g. use absolute value and clipping � > 0:

p = θ −max(|λ−1|, �) g

to handle also negative λ, and λ ≈ 0 situations (e.g. nearinflection point). Finally we can e.g. use

θT+1 = αp+ (1− α) θT

as parameter evolution step, for some α ∈ (0, 1] describingtrust in the parabola model.

While online optimization of parameters can be deter-mined by minimization of evolving criterion (24) like log-likelihood with exponentially weakening weights, there isstill required parameter (e.g. η) of exponential movingaverage - defining how conservative the model should be.It is usually fixed in data compression (adaptivity) or ma-chined learning (e.g. SGD optimizer) as in a wide range itsoptimization can only give a relatively tiny improvement.

So in practice such parameter like η can be just fixed e.g.optimized over a larger set of data. We could also try toslowly adapt it to local condition if controlling also ∂θT /∂ηdependence for used parameters. It would allow to calculate∂f(xT , θT )/∂η:

∂f(xT , θT )

∂η(ηT ) =

∂θT

∂η(ηT ) · ∇θf(xT , θ)|θ=θT

Treating them as noisy derivatives again, we can use agradient methods e.g. after some averaging, or find theirlinear trend with online linear regression like above.

Generally, one could also try to extrapolate e.g. futurebehavior of parameters based on the recent history, however,it requires extreme caution.

REFERENCES[1] M. J. Weinberger, G. Seroussi, and G. Sapiro, “The loco-i lossless

image compression algorithm: Principles and standardization intojpeg-ls,” IEEE Transactions on Image processing, vol. 9, no. 8, pp.1309–1324, 2000.

[2] J. Rissanen, “Modeling by shortest data description,” Automatica,vol. 14, no. 5, pp. 465–471, 1978.

[3] J. Sneyers and P. Wuille, “Flif: Free lossless image format basedon maniac compression,” in 2016 IEEE International Conference onImage Processing (ICIP). IEEE, 2016, pp. 66–70.

[4] T. C. Mills and T. C. Mills, Time series techniques for economists.Cambridge University Press, 1991.

[5] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrentneural networks,” arXiv preprint arXiv:1601.06759, 2016.

[6] J. Menick and N. Kalchbrenner, “Generating high fidelity imageswith subscale pixel networks and multidimensional upscaling,” arXivpreprint arXiv:1812.01608, 2018.

[7] R. Koenker and K. F. Hallock, “Quantile regression,” Journal ofeconomic perspectives, vol. 15, no. 4, pp. 143–156, 2001.

[8] J. Duda, “Exploiting statistical dependencies of time series with hier-archical correlation reconstruction,” arXiv preprint arXiv:1807.04119,2018.

[9] J. Duda and A. Szulc, “Credibility evaluation of income data with hier-archical correlation reconstruction,” arXiv preprint arXiv:1812.08040,2018.

[10] P. R. Tadikamalla, “Random sampling from the exponential powerdistribution,” Journal of the American Statistical Association, vol. 75,no. 371, pp. 683–686, 1980.

[11] N. Fout and K.-L. Ma, “An adaptive prediction-based approach tolossless compression of floating-point volume data,” IEEE Transac-tions on Visualization and Computer Graphics, vol. 18, no. 12, pp.2295–2304, 2012.

[12] A. Haar, “Zur theorie der orthogonalen funktionensysteme,” Mathe-matische Annalen, vol. 69, no. 3, pp. 331–371, 1910.

[13] J. Duda, “Fractal wavelets,” 2014. [Online]. Available: https://github.com/JarekDuda/FractalWavelets

[14] ——, “Sgd momentum optimizer with step estimation by onlineparabola model,” arXiv preprint arXiv:1907.07063, 2019.

https://github.com/JarekDuda/FractalWaveletshttps://github.com/JarekDuda/FractalWavelets

I IntroductionII Parametric-parametric distributionsII-A Context dependenceII-B AdaptivityII-C Exponential power distributionII-D Adaptive least-squares linear regression*

III Practical Laplace example and experimentsIII-A Predictor (c)III-B Context dependent scale parameter b(c)III-C Entropy coding, penalty of Golomb codingIII-D Multi-scale interleaving

IV Conclusion and further workAppendix1 Second order online parameter optimization2 Adaptive minimization with online parabola model

References

1 Parametric context adaptive Laplace distribution for ... · Parametric context adaptive Laplace...

Documents

Transcript of 1 Parametric context adaptive Laplace distribution for ... · Parametric context adaptive Laplace...