Feras Saad Optimal Approximate Sampling POPL 2020 New...

Optimal Approximate Samplingfrom Discrete Probability Distributions

Feras Saad, Cameron Freer, Martin Rinard, Vikash Mansinghka

Computer Science & Artificial Intelligence LaboratoryMassachusetts Institute of Technology

POPL 2020New Orleans, LA, USA

https://github.com/probcomp/optimal-approximate-sampling

https://github.com/probcomp/optimal-approximate-sampling

Talk Outline

● Introduction

● Issues with Floating-Point Samplers

● Overview of Approach

● Experimental Results

● Related Work

● Technical Appendix

What is a random sampling algorithm?

Let w := ( w1, ..., wn ) be a list of positive integers which sum to S.

Let p := ( p1, ..., pn ) be a probability distribution where pi = wi / S (i = 1, ..., n).

A sampling algorithm (sampler) for p is a randomized algorithm A such that:

Pr[A returns integer j ] = pj (j = 1, ..., n)1 2 3 n...

...

p1 p2 p3 pn...

Sampling is a fundamental operation in many fields

A sampler for p is a randomized algorithm A such that:

Pr[A returns integer j ] = pj (j = 1, ..., n)

Sampling is central to many fields…

Robotics Probabilistic Robotics, Thrun et al. 2005Artificial Intelligence Artificial Intelligence: A Modern Approach, Russell & Norvig 1994Computational Statistics Random Variate Generation, Devroye 1986Operations Research Simulation Techniques in Operations Research, Harling 1958Statistical Physics Monte Carlo Methods in Statistical Physics, Binder 1986Financial Engineering Monte Carlo Methods in Financial Engineering, Glasserman 2003Machine Learning An Introduction to MCMC for Machine Learning, Andrieu et al. 2003Systems Biology Randomization and Monte Carlo Methods in Biology, Manly 1991Scientific Computing Monte Carlo Strategies in Scientific Computing, Liu 2001

1 2 3 n...

...

p1 p2 p3 pn...

Libraries provide algorithms for discrete sampling

Talk Outline

● Introduction




● Related Work


Libraries provide algorithms for discrete sampling

Libraries assume analytic model of sampling

Entropy SourceS

0.1941231237… Sampling AlgorithmA“infinitely-precise”

uniform random variate

U

1 2 3 n...

...

p1 p2 p3 pn...

Libraries assume analytic model of sampling

“real RAM”

Pros & cons of analytic model of random sampling

Statisticians prove theorems about infinitely-precise transforms of U (assuming real RAM).

Library developers implement algorithms “directly” (using floating-point approximations).

But, real RAM abstracts away details of entropy resources, numerical precision, computability, and complexity---important for both CS theory and practice.

“Anyone who considers arithmetic methods of producing random digits is, of course, in a state of sin.”

“If one considers arithmetic methods in detail, it is quickly found that the critical thing about them is the very obscure, very imperfectly understood behavior of round-off errors in mathematics.”

“One might as well admit that one can prove nothing, because the amount of theoretical information about the statistical properties of the round-off mechanism is nil.”

“I have a feeling, however, that it is somewhat silly to take a random number and put it elaborately through a power series.”

Von Neumann, J. Various Techniques Used in Connection with Random Digits. In Monte Carlo Method. National Bureau of Standards Applied Mathematics Series, vol 12. 1951.

https://mcnp.lanl.gov/pdf_files/nbs_vonneumann.pdf

Pros & cons of analytic model of random sampling

Statisticians prove theorems about infinitely-precise transforms of U (assuming real RAM).

Library developers implement algorithms “directly” (using floating-point approximations).

But, real RAM abstracts away details of entropy resources, numerical precision, computability, and complexity—important for both CS theory and practice.

“Anyone who considers arithmetic methods of producing random digits is, of course, in a state of sin.”

“If one considers arithmetic methods in detail, it is quickly found that the critical thing about them is the very obscure, very imperfectly understood behavior of round-off errors in mathematics.”

“One might as well admit that one can prove nothing, because the amount of theoretical information about the statistical properties of the round-off mechanism is nil.”

“I have a feeling, however, that it is somewhat silly to take a random number and put it elaborately through a power series.”

Von Neumann, J. Various Techniques Used in Connection with Random Digits. In Monte Carlo Method. National Bureau of Standards Applied Mathematics Series, vol 12. 1951.

https://mcnp.lanl.gov/pdf_files/nbs_vonneumann.pdf

Let p := (p1, ..., pn) be a discrete probability distribution (0 < pi < 1, ∑i pi = 1)

Step 1: Use p to make n bins of unit interval [0,1].

Step 2: Simulate U ~ Uniform([0,1])

Step 3: Return integer j such that U is in bin j

“Throw a dart and choose the bin it lands in”

The inversion method: A universal sampler

...

0 p1 p2 p3

U

1pn

1 2 3 n...

...

p1 p2 p3 pn...

Let p := (p1, ..., pn) be a discrete probability distribution (0 < pi < 1, ∑i pi = 1).


Step 2: Simulate U ~ Uniform([0,1])




...

0 p1 p2 p3

U

1pn

1 2 3 n...

...

p1 p2 p3 pn...



Step 2: Simulate U ~ Uniform([0,1]).




...

0 p1 p2 p3

U

1pn

1 2 3 n...

...

p1 p2 p3 pn...




Step 3: Return integer j such that U is in bin j.



...

0 p1 p2 p3

U

1pn

1 2 3 n...

...

p1 p2 p3 pn...



Step 3: Return integer j such that U is in bin j.


Sources of error in floating-point samplers

...

0 p1 p2 p3

U

1pn

Error 1 Computing normalized probabilities and running sums

Error 2 Computing canonical “real”-valued uniform random variate U on unit interval

Error 1: Computing target probabilities (C++ example)

user specifies list of non-negative numbers(i.e., relative probabilities)



desired probabilities are[1/12, 2/12, 1/12, 3/12, 5/12]



desired probabilities are[1/12, 2/12, 1/12, 3/12, 5/12]

actual probabilities are non-exact (IEEE 754 double-precision floats)

Informally, canonical “real”-random variate U is generated by:

1. Draw k random bits, forming a k-bit integer Z in {0, …, 2k - 1}

2. Set U := Z / 2k

Error 2: Computing canonical variate U (C++ example)

Parametrized by precision, i.e., number of bits used to represent U in floating point

Floating-point samplers also waste random bits

Canonical uniform random variate U is constructed by:

1. Draw k random bits, forming a k-bit integer Z in {0, …, 2k - 1}

2. Set U := Z / 2k

In IEEE 754 double-precision floating point, k = 53 bits (C++ default).

Using 53 bits per sample can be very wasteful (depending on p).

Example Consider distribution p = [½, ½].

Only 1 random bit needed to sample from p.

Naive inversion using U wastes 52 bits!

10 p2 = ½

U

p1 = ½

Recap: Inefficiencies & errors in common samplers

Floating-point sampling algorithms (MATLAB, numpy, R, C++, etc.):

● Waste random bits

a. Generating U using full machine word (e.g., 53-bits) can use significantly more random bits than are theoretically needed to sample from p.

Wasting random bits => excessive calls to underlying PRNG => slower wall-clock time

● Have suboptimal sampling error

a. Sampler manipulates probabilities ( p1, …, pn ) using limited-precision arithmetic.

b. Canonical “real”-random variate U is discrete and only approximately uniform.

Modern samplers generate billions samples per second => small errors can magnify






102x—103xspeedup

103x—104xspeedup

Roy et al. [2013]GNU C++ LibraryOAS (this work)










Uyematsu-Li Interval SamplerGNU C++ LibraryOAS (this work)

102x—104xmore accurate

Talk Outline

● Introduction




● Related Work


Goals for optimal approximate sampling

Design a limited-precision sampler for a discrete probability distribution that:

● is optimally efficient in its use of random bits

(uses the theoretically smallest amount of random bits per sample, on average)

Idea: Rather than generate U all “at once”, lazily generate random bits “one at a time” and use them optimally [Knuth & Yao, 1976].

● is optimally accurate

(attains theoretically smallest amount of sampling error, for the given precision)

Idea: Explicitly minimize error over set of achievable distributions of entropy-optimal limited-precision samplers









102x—103xspeedup

103x—104xspeedup

Roy et al. [2013]GNU C++ LibraryOAS (this work)









Uyematsu-Li Interval SamplerGNU C++ LibraryOAS (this work)

102x—104xmore accurate

1. Precise formulation of optimal approximate sampling for discrete distributions.

2. Efficient algorithms for finding optimal approximations to discrete distributions.

3. Efficient algorithms for constructing entropy-optimal samplers.

4. Empirical comparisons to existing limited-precision samplers.

Superior wall-clock time and sampling accuracy.

5. Empirical comparisons to existing exact samplers.

Enable optimally trading-off precision with accuracy.

Key contributions

Random source S lazily emits fair 0-1 coin flips.(replaces uniform random variate U)

For discrete distribution p := ( p1, ..., pn ), sampler is a partial map from finite sequences of coin flips to outcomes:

A: ⊎{0,1}k → {1,...,n}

For continuous distribution F, sampler is a partial map from finite sequences of coin flips to digits of the real output (in a number a system, e.g., binary expansion, continued fraction):

A: ⊎{0,1}k → ⊎{0,1}d

Random bit model of sampling

D. Knuth, A. Yao. The complexity of nonuniform random number generation. In Algorithms and Complexity: New Directions and Recent Results. 1976.

https://b-ok.cc/book/3518287/39ce5f

Analytic vs. random bit model of sampling

Entropy SourceS

0.1941231237… Sampling AlgorithmA“infinitely-precise”

uniform random variate

U

Entropy SourceS

Sampling AlgorithmAlazily generated

fair/independent bits

b1

0

b2

1

b3

1

b4

0 …

1 2 3 n...

...

p1 p2 p3 pn...

1 2 3 n...

...

p1 p2 p3 pn...

1

1010

0 10 10 1 0 1

0

Every sampler in the random bit model is a tree

Let A be a sampling algorithm for discrete distribution p.

We can represent A as a complete binary tree T (each internal node has two children).

1. Start at the root node.2. Call flip, if 0 go to left child, if 1 go to right child.3. If the child is a leaf, return the label of that node.

Else goto 2.

T is called discrete distribution-generating (DDG) tree.

Example DDG tree for a fair dice, p = (⅙,⅙,⅙,⅙,⅙,⅙);encodes a rejection sampler

(001, 010, 011, 100, 101, 110) -> (1,2,3,4,5,6)(000, 111) -> reject and repeat

1

1010

0 10 10 1 0 1

0

Every sampler in the random bit model is a tree

Let A be a sampling algorithm for discrete distribution p.

We can represent A as a complete binary tree T (each internal node has two children).

1. Start at the root node.2. Call flip, if 0 go to left child, if 1 go to right child.3. If the child is a leaf, return the label of that node.

Else goto 2.

T is called a discrete distribution-generating (DDG) tree.

Example DDG tree for a fair dice, p = (⅙,⅙,⅙,⅙,⅙,⅙);encodes a rejection sampler

(001, 010, 011, 100, 101, 110) -> (1,2,3,4,5,6)(000, 111) -> reject and repeat

Entropy-optimal sampling (Knuth-Yao 1976)

For distribution p = ( p1, …, pn ), find DDG tree with least average number of flips

(i.e., is “entropy-optimal”).

Theorem Entropy-optimal tree has leaf i at level j iff jth bit in binary expansion of pi = 1.



p = (1/2, 1/4, 1/4)

dyadic probabilities, no back edges

Example p = (3/10, 7/10)

Example p = (1/π, 1/e, 1-1/π-



p = (3/10, 7/10)

rational probabilities, back edges

Example p = (3/10, 7/10)

Example p = (1/π, 1/e, 1-1/π-1/e)*uncomputable; not considered here



p = (1/π, 1/e, 1-1/π-1/e)

irrational probabilities; uncomputable, not considered further

Example p = (3/10, 7/10)

Example p = (1/π, 1/e, 1-1/π-1/e)*uncomputable; not considered here


Theorem The expected number of flips in an entropy-optimal sampler satisfies

H(p) ≤ E[ # flips ] < H(p) + 2,

where H(p) := ∑i pi log( 1/pi ) is the Shannon entropy.

Intuition Lower bound is obvious.

Upper bound: Uniform distribution on {1, …, 2k } is full binary tree; needs exactly k bits.

Knuth-Yao prove there is a worst-case 2-bit cost for sampling non-full binary trees.

Bounding number of bits in entropy-optimal DDG

Why not just use exact entropy-optimal samplers from Knuth-Yao?

Example

Entropy-optimal tree for

Binomial(n=50, p=0.61)

has 10104 levels

(i.e., ~1091 terabytes)

Exact entropy-optimal samplers = exponential size

log-linear plot(exponential growth)



Theorem (3.5 and 3.6 in main paper)

Suppose ( w1, …, wn ) are positive integers summing to S; set pi = wi / S ( i = 1, …, n ).

Entropy optimal sampler for p has depth at most S - 1 and this bound is tight.

DDG trees are exponentially large in the number of bits needed to encode p!

- Θ(n log(S)) bits needed to encode the input p

- Θ(n S) bits needed to encode DDG(p) in the worst case.



Theorem (3.5 and 3.6 in main paper)

Suppose ( w1, …, wn ) are positive integers summing to S; set pi = wi / S ( i = 1, …, n ).

Entropy optimal sampler for p has depth at most S - 1 and this bound is tight.

Exact samplers from Knuth & Yao often infeasible to construct in practice.

“Most of the algorithms which achieve these optimum bounds are very complex, requiring a tremendous amount of space”. [KY 1976, pp. 409]

We establish a principled and efficient algorithm for replacingvery deep

an entropy-optimal exact DDG tree(sampler) for p with arbitrarily large depth

withdepth-k

an entropy-optimal closestapproximate sampler for p that has anypre-specified depth k

Approximate entropy-optimal sampling


Problem For distribution p = ( p1, …, pn ), find an entropy-optimal depth-k DDG tree whose output distribution q = ( q1, …, qn ) minimizes the error Δ(p, q).

Error is inevitable since the depth (precision) is limited to k bits.

In random bit model model, depth of DDG tree ⟺ bit-precision of sampler.

We allow the “depth-k” tree to have back-edges.



withdepth-k



Summary of main result (Theorem 4.7)

Given any

- discrete probability distribution p;- specification of k > 0 bits of precision; and- measure of statistical error that is an f-divergence.

we efficiently construct the most accurate sampling algorithm among all entropy-optimal samplers for p that use k bits of precision.

Additional property: Our samplers are more accurate and entropy-efficient than any sampler that consumes at most k random bits per sample.

- Includes all floating-point samplers in standard libraries that transform a uniform variate U (e.g., R, MATLAB, numpy, scipy, C++, GNU GSL, etc.)

Summary of main result (Theorem 4.7 + Prop. 2.16)

Given any


we efficiently construct the most accurate sampling algorithm among all entropy-optimal samplers for p that use k bits of precision.

Additional property: This sampler is more accurate and entropy-efficient than any sampler that consumes at most k random bits per sample.


Summary of main result (Theorem 4.7)

Given any


we construct the most accurate sampling algorithm among all entropy-optimal samplers for p that use k bits of precision.

Additional property: Our samplers are more accurate and entropy-efficient than any sampler that consumes at most k random bits per sample.


f-Divergences: A family of statistical error metrics

Let S be the space of all probability distributions on some finite set.

Definition A statistical divergence Δ on S is a function Δ( . || . ) : S × S →[0, ∞] such that

Δ(p||q) = 0 if and only if p = q.

Definition An f-divergence is a statistical divergence such that

Δg(p||q) = ∑ g(qi / pi ) pi

for some convex function g : (0, ∞) →ℜ with g(1) = 0 ( g is called the generator of Δg ).

f-divergences are widely used in information theory, statistics, machine learning, etc.

Examples of f-divergences & generating functions

Minimizing error is a balls-to-bins problem

Problem For distribution p = ( p1, …, pn ), find an entropy-optimal depth-k DDG tree whose output distribution q = ( q1, …, qn ) minimizes the error Δg(p, q).

Theorem (3.4 in main paper) The output probabilities ( q1, ..., qn ) of an entropy-optimal depth-k DDG tree are all integer multiples of 1 / (2k - 2ℓ I[ℓ < k]), for the same ℓ (0 ≤ ℓ ≤ k).

Optimization Problem (formal) Assign Z balls to n bins such so as to minimize the error:

Δg (p||q) = ∑i g(mi / (Zpi )) pi , (OBJ-FUN)

where mi is the number of balls assigned to bin i (i = 1, …., n), i.e., qi = mi / Z.

(solve OBJ-FUN separately for Z = 2k; 2k - 2k-1; … ; 2k - 2; 2k - 1)


O(n logn) discreteoptimization algorithm

(see main paper)

Talk Outline

● Introduction




● Related Work


Optimal samplers: Wall-clock runtime benefits

Knuth–Yao Sampler from Roy et al. [2013] Inversion Sampler (GNU C++ Library) Optimal Approximate Sampler (OAS; this work)

scales O(n H(p)) <= O(n log(n))scales O(n)scales O(H(p)) <= O(log(n))

103x—104xspeedup

102x—103xspeedup

Optimal samplers: Efficiency benefits

● Optimal approximate samplersCall PRNG a variable number of times => efficiently using random bits (faster).

● Inversion sampler (and other floating-point samplers)Call PRNG a fixed number of times => wasting random bits (slower).

each panel: a common family of discrete distributions

x-axis: relative (Hellinger) error of baseline sampler to our optimal sampler.

y-axis: fraction of 500 random distributions whose relative error is <= value on the x-axis.

Optimal samplers: Accuracy benefits

100x to 10000xmore accurate

Interval Sampler [Uyematsu-Li, 2013]

Inversion Sampler [GNU C++ library]OAS (this work)

Total-Variation(p, q) = ∑i | pi – qi |

“Manhattan Distance”

Low entropyOne outcome with most mass

Medium entropyFew outcomes with most mass

High entropyMany outcomes with equal mass

TV sampling error increases as entropy increases

How many bits do I need? It depends… O

ptim

al E

rror

Error Measure: Total Variation

Entropy of Target Distribution (Bits)

low entropy med. entropy high entropy

KL(p, q) = ∑i log(pi / qi ) pi

“Information-Theoretic Distance”

Low entropyOne outcome with most mass

Medium entropyFew outcomes with most mass

High entropyMany outcomes with equal mass

TV sampling error decreases as entropy increases

How many bits do I need? It depends… O

ptim

al E

rror

Error Measure: KL Divergence

Entropy of Target Distribution (Bits)

low entropy med. entropy high entropy

Entropy of target distribution dictates required bit-precision for desired level of error.

Error vs. entropy behaves differently depending on the choice of error metric.

Optimal approximate sampler algorithms work for many choices of error metrics.

How many bits do I need? It depends…

Error Measure: Total Variation Error Measure: KL Divergence

Entropy of Target Distribution (Bits) Entropy of Target Distribution (Bits)

Opt

imal

Err

or

Opt

imal

Err

or

Optimal samplers versus exact samplers

- roughly same bits/sample as exact entropy-optimal sampler- use significantly less precision than exact entropy-optimal sampler- small approximation error

Optimal samplers versus exact samplers

Talk Outline

● Introduction




● Related Work


Application of DDG sampling: Cryptography

In cryptography, sampling over finite discrete “lattices” is a key subproblem:

● Entropy-optimal sampling needed

(entropy is expensive resource)

● Limited-precision analysis needed

(for theoretical security guarantees)

L. Ducas and P. Q. Nguyen. Faster Gaussian Lattice Sampling using Lazy Floating-Point Arithmetic. ASIACRYPT 2012.Roy, et al. High Precision Discrete Gaussian Sampling on FPGAs. SAC 2013.N. Dwarakanath and S. Galbraith. Sampling from discrete Gaussians for lattice-based cryptography on a constrained device. Appl. Alg. Eng. Comm. Comp., 25(3). 2014.J. Follath. Gaussian Sampling in Lattice Based Cryptography. Tatra Mt. Math. Publ 60. 2014.C. Du and G. Bai. Towards efficient discrete Gaussian sampling for lattice-based cryptography. FPL 2015.

https://link.springer.com/chapter/10.1007/978-3-642-34961-4_26


https://link.springer.com/article/10.1007/s00200-014-0218-3

https://www.sav.sk/journals/uploads/0212094402follat.pdf

https://www.sav.sk/journals/uploads/0212094402follat.pdf

Hardware architecture for sampling DDG trees

Roy, et al. High Precision Discrete Gaussian Sampling on FPGAs. SAC 2013.

The (N × k) binary probability matrix P is encoded into ROM and sampled as follows:

Algorithm Knuth-Yao SamplingInput Probability Matrix POutput Sample in [0, ..., N-1]

d = 0col = 0while True:

r = flip()d = 2*d + (1-r)for row in [N-1, …, 0]:

d -= P[row][col]if d == -1:

return rowcol = col + 1


Related work

1. Exact sampling with near-optimal entropy and linear spaceThe Fast Loaded Dice Roller [Saad et al., 2020] (AISTATS, coming soon)

2. Coalgebraic framework for implementing and composing entropy-preserving reductions between arbitrary input sources to output distributions

Kozen & Soloviev [2018]; see also Pae & Loui [2006] for asymptotically-optimal variable-length conversions using coins of unknown bias

3. Limited-precision samplers for discrete distributionsrandom graph [Blanca & Mihail 2012] geometric [Bringmann & Friedrich 2013],uniform [Lumbroso 2013], discrete Gaussian [Folláth 2014], general [Uyematsu & Li 2003]

4. Variants of random bit model (biased/unknown/non i.i.d. sources, variable precision)[von Neumann 1951; Elias 1972; Stout & Warren 1984; Blum 1986; Roche 1991; Peres 1992; Han & Verdú 1993; Vembu & Verdú 1995; Abrahams 1996; Cicalese et al. 2006; Kozen 2014]

Talk Outline

● Introduction




● Related Work


Bernoulli sampling in the random bit model

Suppose p ∈ (0, 1) and write p = (0.p1 p2 p3…)2 in the base-2 expansion.

analytic sampler random bit sampler

Generate uniform real U i = 0return 1 if U < p else 0 repeat

i = i + 1Generate random bit B

until B ≠ pireturn 1 if B < pi else 0

Rejection sampling in the random bit model

w1

1 .. .. 1 2 .. 2 .. n .. n R R R R

w2 wn REJECT

S 2k

* what is expected # flips?* is the method efficient? (in space? in entropy?)

0

Suppose ( w1, …, wn ) are positive integers summing to S and put

pi = wi / S ( i = 1, …, n )

Fix integer k so that 2k-1 < S ≤ 2k

random bit sampler “choose a random cell in this table”

repeatgenerate k random bitsforming integer Z

until Z < Mreturn Table[Z] (shown right)


Suppose k = 7 bits.Truncate P after column 7 to get Q.

Issue: Probabilities of q sum to < 1.Solution? Normalize Q by adding 1s randomly.

Example Suppose Δ is KL Divergence: KL(p, q) = ∑i log(pi / qi ) pi .Suppose p = [ϵ, (1-ϵ)/2, (1-ϵ)/2]; ϵ ≪ 1/2k.Then first k digits of ϵ are 0, at least one unit needs to be added to Q.

Case 1: Add units to q1 ⟹ KL(p || q) < ∞Case 2: Do not add units to q1 ⟹ KL(p || q) = ∞

0 0 0 0 0 0 0 1 0 1 1 1 1 1 0

0 1 0 1 1 1 1 0 0 0 1 0 1 1 0

0 1 0 1 0 0 0 0 0 1 0 1 0 1 0

p1

p2

p3

Naive truncation can result in large sampling errors

Q

= 0.

P


Suppose k = 7 bits.Truncate P after column 7 to get Q.

Issue: Probabilities of q sum to < 1.Solution? Normalize Q by adding 1s randomly.

Example Suppose Δ is Pearson Chi-Square: χ2(p || q) = ∑i (pi - qi)2 / pi .

Suppose p = [ϵ, ϵ, ϵ, ϵ, 1-4ϵ]; ϵ ≪ 1/2k.Then first k digits of 1-4ϵ have k ones, so 1 unit (2-k) needs to be added to last column of Q.

Case 1: Give last unit to q1 ⟹ χ(p || q) = 4ϵ + (1-4ϵ-1)²/ϵ² = 16ϵ ≈ 0.Case 2: Give last unit to q5 ⟹ χ(p || q) ≥ (ϵ - 1/2ᵏ)²/ϵ² ≈ 1/(ϵ²2²ᵏ) ≫ 0.

0 0 0 0 0 0 0 1 0 1 1 1 1 1 0

0 1 0 1 1 1 1 0 0 0 1 0 1 1 0

0 1 0 1 0 0 0 0 0 1 0 1 0 1 0

p1

p2

p3


P

Q

= 0.


Summary of previous examples: error of truncated distribution is sensitive to

- target distribution p;

- precision specification k;

- definition error measure Δ.

We will develop a general strategy that guarantees obtaining the least possible error for any

setting of these parameters.


Problem For distribution p = ( p1, …, pn ), find an entropy-optimal depth-k DDG tree whose output distribution q = ( q1, …, qn ) minimizes the error Δg(p, q).

Theorem (3.4 in main paper) The output probabilities ( q1, ..., qn ) of an entropy-optimal depth-k DDG tree are all integer multiples of 1 / (2k - 2ℓ I[ℓ < k]), for the same ℓ (0 ≤ ℓ ≤ k).









In the next slides, we will solve this problem (for any value of Z).

Notation Denote the set of assignments of Z indistinguishable balls to n bins by

𝓜[n, Z] = { (m1, …, mn) | 0 ≤ mi ≤ Z and ∑i mi = Z }.


Theorem (4.11 in main paper) Let mZ = (m1, ..., mn) be an optimum of OBJ-FN, where mZ belongs to 𝓜[n, t] for some t > 0.Then mZ+1

= (m1, ..., mj + 1, …, mn) is an optimum of OBJ-FUN over 𝓜[n, t+1], where

j = kargmini=1,…,n pi ( f((mi + 1) / (2kpi)) - g(mi / (Zpi)) )

In English: “Find the index j where incrementing mt gives the least increase in the error.”

Proof (Sketch) The error-delta: pi ( g((mi + 1) / (Zpi)) - g(mi / (Zpi)) )is the slope of the secant that connects the points(mi + 1) / Zpi and mi / Zpi.

Leverage the fact that for any convex f the slopes satisfy:

and the optimality of mZ to show that any other solutionm’Z+1 can always be made more optimal by moving m’t one unit (up or down) toward mt + 1.

Idea: Greedy optimization (Version 1)

f

< <

Theorem (4.11 in main paper) Let mZ = (m1, ..., mn) be an optimum of OBJ-FN, where mZ belongs to 𝓜[n, t] for some t > 0.Then mZ+1

= (m1, ..., mj + 1, …, mn) is an optimum of OBJ-FUN over 𝓜[n, t+1], where

j = argmini=1,…,n pi ( f((mi + 1) / (2kpi)) - g(mi / (Zpi)) )

If we can find any optimum, use the Theorem to make it sum to Z.Observation: 𝓜[n, 0] has a single element: (0, ..., 0) which is therefore optimal.

Algorithm runtime = O(n+2k).1. Initialize m = (0, …, 0)2. For each i = 1, …, 2k:

a. Find j as defined above.b. Increment mj = mj + 1.

A first-pass algorithm (exponential time)

Theorem (4.10 in main paper) Suppose m ∈ 𝓜[n,k,Z] for some Z > 0, not necessarily optimal; m can be made into an optimum of OBJ-FN over 𝓜[n, Z] by greedy local choices:

repeat forever: let j = argmini=1,...,n pi ( g((mi + 1) / (Zpi)) - f(mi / (Zpi)) ) // least cost of increment is ϵj let l = argmini=1,...,n pi ( g((mi ‒ 1) / (Zpi)) - f(mi / (Zpi)) ) // least cost of decrement is ϵl if ϵj + ϵl < 0:

set mj = mj + 1set ml = ml - 1

else return (m1, …, mn)

Use Theorem to optimize a random assignment:

Algorithm worst-case runtime = O(n+2k).1. Initialize m = (2k, …, 0)2. Run loop in the Theorem.

Idea: Greedy optimization (Version 2)

Combining greedy optimization algorithms

Theorem 4.11 “an optimal solution that sums to Z can quickly make it optimal for Z±1.”Theorem 4.10 “non-optimal solution that sums Z can be greedily made optimal for Z.”

If we could find an initial tuple m* that has the following properties:Property 1: m* is n units away from being optimal for 𝓜[n, ∑m*];Property 2: | Z - ∑m* | ≤ n,

then we have the following linear time algorithm for optimizing OBJ-FN:

1. Initialize solution to m*.2. Use Thm 4.10 to make m* optimal for 𝓜[n,k,∑m*]. // O(n) by Property 13. Use Thm 4.11 to make m* optimal for 𝓜[n,k,2k]. // O(n) by Property 24. Return m*.

Theorem (4.14 in main paper) The following initialization satisfies Properties 1 and 2:mi = Floor[ Zpi ] + I[[f( Floor[ Zpi ] / Zpi ) < f( (Floor[ Zpi ] + 1) / Zpi ) ]] ( i = 1, … n ).

Intuition for the choice of initial assignment

Theorem (4.14 in main paper) The following initialization satisfies Properties 1 and 2:mi = Floor[ Zpi ] + I[[f( Floor[ Zpi ] / Zpi ) < f( (Floor[ Zpi ] + 1) / Zpi ) ]] ( i = 1, … n ).

generator g > 0 generator g < 0 on some region (0, 1)

1 1 1[ Zpi ] / Zpi ([ Zpi ]+1) / Zpi

([ Zpj ]+1) / Zpj

([ Zpj ]+2) / Zpj

([ Zpi ]) / Zpi

([ Zpi ]-1) / Zpi

generator g < 0 on some region (1, ∞)

([ Zpj ]) / Zpj

([ Zpj ]-1) / Zpj

([ Zpi ]) / Zpi

([ Zpi ]+1) / Zpi

Case 1 Case 2 Case 3



withdepth-k



Feras Saad Optimal Approximate Sampling POPL 2020 New...

Documents

Transcript of Feras Saad Optimal Approximate Sampling POPL 2020 New...