One Algorithm to Rule Them All: How to Automate Statistical Computation

40
Can one algorithm rule them all? How to automate statistical computations Alp Kucukelbir COLUMBIA UNIVERSITY

Transcript of One Algorithm to Rule Them All: How to Automate Statistical Computation

Page 1: One Algorithm to Rule Them All: How to Automate Statistical Computation

Can one algorithm rule them all?

How to automate statistical computations

Alp Kucukelbir

COLUMBIA UNIVERSITY

Page 2: One Algorithm to Rule Them All: How to Automate Statistical Computation

Can one algorithm rule them all?

Not yet. (But some tools can help!)

Page 3: One Algorithm to Rule Them All: How to Automate Statistical Computation

Rajesh Ranganath Dustin Tran

Andrew Gelman David Blei

Page 4: One Algorithm to Rule Them All: How to Automate Statistical Computation

Machine Learning

datamachinelearning

hiddenpatterns

We want to discover and explore hidden patterns� to study hard-to-see connections,� to predict future outcomes,� to explore causal relationships.

Page 5: One Algorithm to Rule Them All: How to Automate Statistical Computation

How taxis navigate the city of Porto [1.7m trips] (K et.al., 2016).

Page 6: One Algorithm to Rule Them All: How to Automate Statistical Computation

How do we use machine learning?

Page 7: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datamachinelearningexpert

hiddenpatterns

many months later

Page 8: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datamachinelearningexpert

hiddenpatterns

many months later

Page 9: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datamachinelearningexpert

hiddenpatterns

many months later

Statistical Model� Make assumptions about data.� Capture uncertainties using probability.

Page 10: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datamachinelearningexpert

hiddenpatterns

many months later

Statistical Model� Make assumptions about data.� Capture uncertainties using probability.

Page 11: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datamachinelearningexpert

hiddenpatterns

many months later

Statistical Model� Make assumptions about data.� Capture uncertainties using probability.

Machine Learning Expert� aka a PhD student.

Page 12: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datamachinelearningexpert

hiddenpatterns

many months later

Statistical Model� Make assumptions about data.� Capture uncertainties using probability.

Machine Learning Expert� aka a PhD student.

Page 13: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datamachinelearningexpert

hiddenpatterns

many months later

Machine learning should be

1. Easy to use 2. Scalable 3. Flexible.

Page 14: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

dataautomatic

toolhiddenpatternsinstant

revise

Machine learning should be

1. Easy to use 2. Scalable 3. Flexible.

“[Statistical] models are developed iteratively: we build amodel, use it to analyze data, assess how it succeeds andfails, revise it, and repeat.” (Box, 1960; Blei, 2014)

Page 15: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

dataautomatic

toolhiddenpatternsinstant

revise

Machine learning should be

1. Easy to use 2. Scalable 3. Flexible.

“[Statistical] models are developed iteratively: we build amodel, use it to analyze data, assess how it succeeds andfails, revise it, and repeat.” (Box, 1960; Blei, 2014)

Page 16: One Algorithm to Rule Them All: How to Automate Statistical Computation

What does this automatic tool need to do?

Page 17: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datamachinelearningexpert

hiddenpatterns

many months later

Page 18: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datainference(maths)

inference(algorithm)

hiddenpatterns

Page 19: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datainference(maths)

inference(algorithm)

hiddenpatterns

X θ

Bayesian Model

likelihood p(X | θ )model p(X,θ ) = p(X | θ ) p(θ )

prior p(θ )

The model describes a data generating process.

The latent variables θ capture hidden patterns.

Page 20: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datainference(maths)

inference(algorithm)

hiddenpatterns

X θ

Bayesian Model

likelihood p(X | θ )model p(X,θ ) = p(X | θ ) p(θ )

prior p(θ )

The model describes a data generating process.

The latent variables θ capture hidden patterns.

Page 21: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datainference(maths)

inference(algorithm)

hiddenpatterns

X θ

Bayesian Inference

posterior p(θ | X) =p(X,θ )

p(X,θ )dθ

The posterior describes hidden patterns given data X.

It is typically intractable.

Page 22: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

datainference(maths)

inference(algorithm)

hiddenpatterns

X θ

Approximating the Posterior

Sampling draw samples using MCMCVariational approximate using a simple function

The computations depend heavily on the model!

Page 23: One Algorithm to Rule Them All: How to Automate Statistical Computation

Common Statistical ComputationsExpectations

Eq(θ ;φ)

log p(X,θ )�

=

log p(X,θ ) q(θ ;φ)dθ

Gradients (of expectations)

∇φEq(θ ;φ)

log p(X,θ )�

Maximization (by following gradients)

maxφEq(θ ;φ)

log p(X,θ )�

Page 24: One Algorithm to Rule Them All: How to Automate Statistical Computation

Automating ExpectationsMonte Carlo sampling

θ

f(θ )

a a+ 1θ

f(θ )

a a+ 1

f(θ (s))

∫ a+1

a

f(θ )dθ ≈1S

S∑

s=1

f(θ (s))

where θ (s) ∼ Uniform(a, a+ 1)

Page 25: One Algorithm to Rule Them All: How to Automate Statistical Computation

Automating ExpectationsMonte Carlo sampling

Eq(θ ;φ)

log p(X,θ )�

=

log p(X,θ ) q(θ ;φ)dθ

≈1S

S∑

s=1

log p(X,θ (s))

where θ (s) ∼ q(θ ;φ)

Monte Carlo Statistical Methods, Robert and Casella, 1999

Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009

Page 26: One Algorithm to Rule Them All: How to Automate Statistical Computation

Automating ExpectationsProbability Distributions� Stan, GSL (C++)� NumPy, SciPy, edward (Python)� built-in (R)� Distributions.jl (Julia)

Page 27: One Algorithm to Rule Them All: How to Automate Statistical Computation

Automating GradientsSymbolic or Automatic Differentiation

Let f(x1, x2) = log x1+x1x2−sin x2. Compute ∂ f(2, 5)/∂ x1.

Automatic di↵erentiation in machine learning: a survey 9

Table 2 Forward mode AD example, with y = f(x1, x2) = ln(x1) + x1x2 � sin(x2) at

(x1, x2) = (2, 5) and setting x1 = 1 to compute @y@x1

. The original forward run on the left

is augmented by the forward AD operations on the right, where each line supplements theoriginal on its left.

Forward Evaluation Trace

v�1 = x1 = 2

v0 = x2 = 5

v1 = ln v�1 = ln 2

v2 = v�1⇥v0 = 2 ⇥ 5

v3 = sin v0 = sin 5

v4 = v1 + v2 = 0.693 + 10

v5 = v4 � v3 = 10.693 + 0.959

y = v5 = 11.652

Forward Derivative Trace

v�1 = x1 = 1

v0 = x2 = 0

v1 = v�1/v�1 = 1/2

v2 = v�1⇥v0+v0⇥v�1 = 1⇥5+0⇥2

v3 = v0 ⇥ cos v0 = 0 ⇥ cos 5

v4 = v1 + v2 = 0.5 + 5

v5 = v4 � v3 = 5.5 � 0

y = v5 = 5.5

each intermediate variable vi a derivative

vi =@vi

@x1.

Applying the chain rule to each elementary operation in the forward evalu-ation trace, we generate the corresponding derivative trace, given on the righthand side of Table 2. Evaluating variables vi one by one together with theircorresponding vi values gives us the required derivative in the final variablev5 = @y

@x1.

This generalizes naturally to computing the Jacobian of a function f :Rn ! Rm with n independent variables xi and m dependent variables yj .In this case, each forward pass of AD is initialized by setting only one ofthe variables xi = 1 (in other words, setting x = ei, where ei is the i-th unitvector). A run of the code with specific input values x = a would then compute

yj =@yj

@xi

����x=a

, j = 1, . . . , m ,

giving us one column of the Jacobian matrix

Jf =

264

@y1

@x1· · · @y1

@xn

.... . .

...@ym

@x1· · · @ym

@xn

375

�������x = a

evaluated at point a. Thus, the full Jacobian can be computed in n evaluations.Furthermore, forward mode AD provides a very e�cient and matrix-free

way of computing Jacobian-vector products

Jf r =

264

@y1

@x1· · · @y1

@xn

.... . .

...@ym

@x1· · · @ym

@xn

375

264

r1

...rn

375 , (4)

Automatic differentiation in machine learning: a survey, Baydin

et al., 2015

Page 28: One Algorithm to Rule Them All: How to Automate Statistical Computation

#i n c l u d e < stan /math . hpp>

i n t main ( ) {us ing namespace std ;stan : : math : : var x1 = 2 , x2 = 5 ;

stan : : math : : var f ;f = log ( x1 ) + x1*x2 - s i n ( x2 ) ;cout << " f ( x1 , x2 ) = " << f . va l ( ) << endl ;

f . grad ( ) ;

cout << " df / dx1 = " << x1 . adj ( ) << endl<< " df / dx2 = " << x2 . adj ( ) << endl ;

r e turn 0 ;}

The Stan math library, Carpenter et al., 2015

Page 29: One Algorithm to Rule Them All: How to Automate Statistical Computation

Automating GradientsAutomatic Differentiation� Stan, Adept, CppAD (C++)� autograd, Tensorflow (Python)� radx (R)� http://www.juliadiff.org/ (Julia)

Symbolic Differentiation� SymbolicC++ (C++)� SymPy, Theano (Python)� Deriv, Ryacas (R)� http://www.juliadiff.org/ (Julia)

Page 30: One Algorithm to Rule Them All: How to Automate Statistical Computation

Stochastic Optimization� Follow noisy unbiased gradients.

8.5. Online learning and stochastic optimization 265

black line = LMS trajectory towards LS soln (red cross)

w0

w1

−1 0 1 2 3−1

−0.5

0

0.5

1

1.5

2

2.5

3

(a)

0 5 10 15 20 25 303

4

5

6

7

8

9

10

RSS vs iteration

(b)

Figure 8.8 Illustration of the LMS algorithm. Left: we start from θ = (−0.5, 2) and slowly convergingto the least squares solution of θ = (1.45, 0.92) (red cross). Right: plot of objective function over time.Note that it does not decrease monotonically. Figure generated by LMSdemo.

where i = i(k) is the training example to use at iteration k. If the data set is streaming, we usei(k) = k; we shall assume this from now on, for notational simplicity. Equation 8.86 is easyto interpret: it is the feature vector xk weighted by the difference between what we predicted,yk = θT

k xk , and the true response, yk ; hence the gradient acts like an error signal.After computing the gradient, we take a step along it as follows:

θk+1 = θk − ηk(yk − yk)xk (8.87)

(There is no need for a projection step, since this is an unconstrained optimization problem.)This algorithm is called the least mean squares or LMS algorithm, and is also known as thedelta rule, or the Widrow-Hoff rule.

Figure 8.8 shows the results of applying this algorithm to the data shown in Figure 7.2. Westart at θ = (−0.5, 2) and converge (in the sense that ||θk − θk−1||22 drops below a thresholdof 10−2) in about 26 iterations.

Note that LMS may require multiple passes through the data to find the optimum. Bycontrast, the recursive least squares algorithm, which is based on the Kalman filter and whichuses second-order information, finds the optimum in a single pass (see Section 18.2.3). See alsoExercise 7.7.

8.5.4 The perceptron algorithm

Now let us consider how to fit a binary logistic regression model in an online manner. Thebatch gradient was given in Equation 8.5. In the online case, the weight update has the simpleform

θk = θk−1 − ηkgi = θk−1 − ηk(µi − yi)xi (8.88)

where µi = p(yi = 1|xi, θk) = E [yi|xi, θk]. We see that this has exactly the same form as theLMS algorithm. Indeed, this property holds for all generalized linear models (Section 9.3).

Figure 8.8a.

� Scale up by subsampling the data at each step.Machine Learning: a Probabilistic Perspective, Murphy, 2012

Page 31: One Algorithm to Rule Them All: How to Automate Statistical Computation

Stochastic OptimizationGeneric Implementations� Vowpal Wabbit, sgd (C++)� Theano, Tensorflow (Python)� sgd (R)� SGDOptim.jl (Julia)

Page 32: One Algorithm to Rule Them All: How to Automate Statistical Computation

ADVI (Automatic Differentiation Variational Inference)

An easy-to use, scalable, flexible algorithm

smc‐ tan.org

� Stan is a probabilistic programming system.1. Write the model in a simple language.2. Provide data.3. Run.

� RStan, PyStan, Stan.jl, ...

Page 33: One Algorithm to Rule Them All: How to Automate Statistical Computation

How taxis navigate the city of Porto [1.7m trips] (K et.al., 2016).

Page 34: One Algorithm to Rule Them All: How to Automate Statistical Computation

Exploring Taxi RidesData: 1.7 million taxi rides

� Write down a pPCA model. (∼minutes)� Use ADVI to infer subspace. (∼hours)� Project data into pPCA subspace. (∼minutes)� Write down a mixture model. (∼minutes)� Use ADVI to find patterns. (∼minutes)

� Write down a supervised pPCA model. (∼minutes)� Repeat. (∼hours)

� What would have taken us weeks→ a single day.

Page 35: One Algorithm to Rule Them All: How to Automate Statistical Computation

statistical model

dataautomatic

toolhiddenpatternsinstant

revise

Monte Carlo Statistical Methods, Robert and Casella, 1999

Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009

Automatic differentiation in machine learning: a survey, Baydin et al., 2015

The Stan math library, Carpenter et al., 2015

Machine Learning: a Probabilistic Perspective, Murphy, 2012

Automatic differentiation variational inference, K et al., 2016

proditus.com mc-stan.org Thank you!

Page 36: One Algorithm to Rule Them All: How to Automate Statistical Computation

EXTRA SLIDES

Page 37: One Algorithm to Rule Them All: How to Automate Statistical Computation
Page 38: One Algorithm to Rule Them All: How to Automate Statistical Computation
Page 39: One Algorithm to Rule Them All: How to Automate Statistical Computation

Kullback Leibler Divergence

KL (q(θ ) ‖ p(θ | X)) =∫

θ

q(θ ) logq(θ )

p(θ | X)dθ

= Eq(θ )

logq(θ )

p(θ | X)

= Eq(θ ) [log q(θ )− log p(θ | X)]

Page 40: One Algorithm to Rule Them All: How to Automate Statistical Computation

Related Objective Function

L (φ) = log p(X)− KL (q(θ ) ‖ p(θ | X))= log p(X)−Eq(θ ) [log q(θ )− log p(θ | X)]= log p(X) +Eq(θ ) [log p(X | θ )]−Eq(θ ) [log q(θ )]= Eq(θ ) [log p(θ ,X)]−Eq(θ ) [log q(θ )]

= Eq(θ ;φ)

log p(X,θ )�

cross-entropy

−Eq(θ ;φ)

log q(θ ; φ)�

entropy