One Algorithm to Rule Them All: How to Automate Statistical Computation

Can one algorithm rule them all?

How to automate statistical computations

Alp Kucukelbir

COLUMBIA UNIVERSITY

Can one algorithm rule them all?

Not yet. (But some tools can help!)

Rajesh Ranganath Dustin Tran

Andrew Gelman David Blei

Machine Learning

datamachinelearning

hiddenpatterns

We want to discover and explore hidden patterns� to study hard-to-see connections,� to predict future outcomes,� to explore causal relationships.

How taxis navigate the city of Porto [1.7m trips] (K et.al., 2016).

How do we use machine learning?

statistical model

datamachinelearningexpert

hiddenpatterns

many months later

statistical model


hiddenpatterns

many months later

Statistical Model� Make assumptions about data.� Capture uncertainties using probability.

statistical model


hiddenpatterns

many months later

Statistical Model� Make assumptions about data.� Capture uncertainties using probability.

Machine Learning Expert� aka a PhD student.

statistical model


hiddenpatterns

many months later

Machine learning should be

1. Easy to use 2. Scalable 3. Flexible.

statistical model

dataautomatic

toolhiddenpatternsinstant

revise

Machine learning should be

1. Easy to use 2. Scalable 3. Flexible.

“[Statistical] models are developed iteratively: we build amodel, use it to analyze data, assess how it succeeds andfails, revise it, and repeat.” (Box, 1960; Blei, 2014)

What does this automatic tool need to do?

statistical model


hiddenpatterns

many months later

statistical model

datainference(maths)

inference(algorithm)

hiddenpatterns

statistical model



hiddenpatterns

X θ

Bayesian Model

likelihood p(X | θ )model p(X,θ ) = p(X | θ ) p(θ )

prior p(θ )

The model describes a data generating process.

The latent variables θ capture hidden patterns.

statistical model



hiddenpatterns

X θ

Bayesian Inference

posterior p(θ | X) =p(X,θ )

∫

p(X,θ )dθ

The posterior describes hidden patterns given data X.

It is typically intractable.

statistical model



hiddenpatterns

X θ

Approximating the Posterior

Sampling draw samples using MCMCVariational approximate using a simple function

The computations depend heavily on the model!

Common Statistical ComputationsExpectations

Eq(θ ;φ)

�

log p(X,θ )�

=

∫

log p(X,θ ) q(θ ;φ)dθ

Gradients (of expectations)

∇φEq(θ ;φ)

�

log p(X,θ )�

Maximization (by following gradients)

maxφEq(θ ;φ)

�

log p(X,θ )�

Automating ExpectationsMonte Carlo sampling

θ

f(θ )

a a+ 1θ

f(θ )

a a+ 1

f(θ (s))

∫ a+1

a

f(θ )dθ ≈1S

S∑

s=1

f(θ (s))

where θ (s) ∼ Uniform(a, a+ 1)

Automating ExpectationsMonte Carlo sampling

Eq(θ ;φ)

�

log p(X,θ )�

=

∫

log p(X,θ ) q(θ ;φ)dθ

≈1S

S∑

s=1

log p(X,θ (s))

where θ (s) ∼ q(θ ;φ)

Monte Carlo Statistical Methods, Robert and Casella, 1999

Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009

Automating ExpectationsProbability Distributions� Stan, GSL (C++)� NumPy, SciPy, edward (Python)� built-in (R)� Distributions.jl (Julia)

Automating GradientsSymbolic or Automatic Differentiation

Let f(x1, x2) = log x1+x1x2−sin x2. Compute ∂ f(2, 5)/∂ x1.

Automatic di↵erentiation in machine learning: a survey 9

Table 2 Forward mode AD example, with y = f(x1, x2) = ln(x1) + x1x2 � sin(x2) at

(x1, x2) = (2, 5) and setting x1 = 1 to compute @y@x1

. The original forward run on the left

is augmented by the forward AD operations on the right, where each line supplements theoriginal on its left.

Forward Evaluation Trace

v�1 = x1 = 2

v0 = x2 = 5

v1 = ln v�1 = ln 2

v2 = v�1⇥v0 = 2 ⇥ 5

v3 = sin v0 = sin 5

v4 = v1 + v2 = 0.693 + 10

v5 = v4 � v3 = 10.693 + 0.959

y = v5 = 11.652

Forward Derivative Trace

v�1 = x1 = 1

v0 = x2 = 0

v1 = v�1/v�1 = 1/2

v2 = v�1⇥v0+v0⇥v�1 = 1⇥5+0⇥2

v3 = v0 ⇥ cos v0 = 0 ⇥ cos 5

v4 = v1 + v2 = 0.5 + 5

v5 = v4 � v3 = 5.5 � 0

y = v5 = 5.5

each intermediate variable vi a derivative

vi =@vi

@x1.

Applying the chain rule to each elementary operation in the forward evalu-ation trace, we generate the corresponding derivative trace, given on the righthand side of Table 2. Evaluating variables vi one by one together with theircorresponding vi values gives us the required derivative in the final variablev5 = @y

@x1.

This generalizes naturally to computing the Jacobian of a function f :Rn ! Rm with n independent variables xi and m dependent variables yj .In this case, each forward pass of AD is initialized by setting only one ofthe variables xi = 1 (in other words, setting x = ei, where ei is the i-th unitvector). A run of the code with specific input values x = a would then compute

yj =@yj

@xi

��x=a

, j = 1, . . . , m ,

giving us one column of the Jacobian matrix

Jf =

264

@y1

@x1· · · @y1

@xn

.... . .

...@ym

@x1· · · @ym

@xn

375

��x = a

evaluated at point a. Thus, the full Jacobian can be computed in n evaluations.Furthermore, forward mode AD provides a very e�cient and matrix-free

way of computing Jacobian-vector products

Jf r =

264

@y1

@x1· · · @y1

@xn

.... . .

...@ym

@x1· · · @ym

@xn

375

264

r1

...rn

375 , (4)

Automatic differentiation in machine learning: a survey, Baydin

et al., 2015

#i n c l u d e < stan /math . hpp>

i n t main ( ) {us ing namespace std ;stan : : math : : var x1 = 2 , x2 = 5 ;

stan : : math : : var f ;f = log ( x1 ) + x1*x2 - s i n ( x2 ) ;cout << " f ( x1 , x2 ) = " << f . va l ( ) << endl ;

f . grad ( ) ;

cout << " df / dx1 = " << x1 . adj ( ) << endl<< " df / dx2 = " << x2 . adj ( ) << endl ;

r e turn 0 ;}

The Stan math library, Carpenter et al., 2015

Automating GradientsAutomatic Differentiation� Stan, Adept, CppAD (C++)� autograd, Tensorflow (Python)� radx (R)� http://www.juliadiff.org/ (Julia)

Symbolic Differentiation� SymbolicC++ (C++)� SymPy, Theano (Python)� Deriv, Ryacas (R)� http://www.juliadiff.org/ (Julia)

Stochastic Optimization� Follow noisy unbiased gradients.

8.5. Online learning and stochastic optimization 265

black line = LMS trajectory towards LS soln (red cross)

w0

w1

−1 0 1 2 3−1

−0.5

0

0.5

1

1.5

2

2.5

3

(a)

0 5 10 15 20 25 303

4

5

6

7

8

9

10

RSS vs iteration

(b)

Figure 8.8 Illustration of the LMS algorithm. Left: we start from θ = (−0.5, 2) and slowly convergingto the least squares solution of θ = (1.45, 0.92) (red cross). Right: plot of objective function over time.Note that it does not decrease monotonically. Figure generated by LMSdemo.

where i = i(k) is the training example to use at iteration k. If the data set is streaming, we usei(k) = k; we shall assume this from now on, for notational simplicity. Equation 8.86 is easyto interpret: it is the feature vector xk weighted by the difference between what we predicted,yk = θT

k xk , and the true response, yk ; hence the gradient acts like an error signal.After computing the gradient, we take a step along it as follows:

θk+1 = θk − ηk(yk − yk)xk (8.87)

(There is no need for a projection step, since this is an unconstrained optimization problem.)This algorithm is called the least mean squares or LMS algorithm, and is also known as thedelta rule, or the Widrow-Hoff rule.

Figure 8.8 shows the results of applying this algorithm to the data shown in Figure 7.2. Westart at θ = (−0.5, 2) and converge (in the sense that ||θk − θk−1||22 drops below a thresholdof 10−2) in about 26 iterations.

Note that LMS may require multiple passes through the data to find the optimum. Bycontrast, the recursive least squares algorithm, which is based on the Kalman filter and whichuses second-order information, finds the optimum in a single pass (see Section 18.2.3). See alsoExercise 7.7.

8.5.4 The perceptron algorithm

Now let us consider how to fit a binary logistic regression model in an online manner. Thebatch gradient was given in Equation 8.5. In the online case, the weight update has the simpleform

θk = θk−1 − ηkgi = θk−1 − ηk(µi − yi)xi (8.88)

where µi = p(yi = 1|xi, θk) = E [yi|xi, θk]. We see that this has exactly the same form as theLMS algorithm. Indeed, this property holds for all generalized linear models (Section 9.3).

Figure 8.8a.

� Scale up by subsampling the data at each step.Machine Learning: a Probabilistic Perspective, Murphy, 2012

Stochastic OptimizationGeneric Implementations� Vowpal Wabbit, sgd (C++)� Theano, Tensorflow (Python)� sgd (R)� SGDOptim.jl (Julia)

ADVI (Automatic Differentiation Variational Inference)

An easy-to use, scalable, flexible algorithm

smc‐ tan.org

� Stan is a probabilistic programming system.1. Write the model in a simple language.2. Provide data.3. Run.

� RStan, PyStan, Stan.jl, ...

How taxis navigate the city of Porto [1.7m trips] (K et.al., 2016).

Exploring Taxi RidesData: 1.7 million taxi rides

� Write down a pPCA model. (∼minutes)� Use ADVI to infer subspace. (∼hours)� Project data into pPCA subspace. (∼minutes)� Write down a mixture model. (∼minutes)� Use ADVI to find patterns. (∼minutes)

� Write down a supervised pPCA model. (∼minutes)� Repeat. (∼hours)

� What would have taken us weeks→ a single day.

statistical model

dataautomatic

toolhiddenpatternsinstant

revise

Monte Carlo Statistical Methods, Robert and Casella, 1999

Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009

Automatic differentiation in machine learning: a survey, Baydin et al., 2015

The Stan math library, Carpenter et al., 2015

Machine Learning: a Probabilistic Perspective, Murphy, 2012

Automatic differentiation variational inference, K et al., 2016

proditus.com mc-stan.org Thank you!

EXTRA SLIDES

Kullback Leibler Divergence

KL (q(θ ) ‖ p(θ | X)) =∫

θ

q(θ ) logq(θ )

p(θ | X)dθ

= Eq(θ )

�

logq(θ )

p(θ | X)

�

= Eq(θ ) [log q(θ )− log p(θ | X)]

Related Objective Function

L (φ) = log p(X)− KL (q(θ ) ‖ p(θ | X))= log p(X)−Eq(θ ) [log q(θ )− log p(θ | X)]= log p(X) +Eq(θ ) [log p(X | θ )]−Eq(θ ) [log q(θ )]= Eq(θ ) [log p(θ ,X)]−Eq(θ ) [log q(θ )]

= Eq(θ ;φ)

�

log p(X,θ )�

cross-entropy

−Eq(θ ;φ)

�

log q(θ ; φ)�

entropy

One Algorithm to Rule Them All: How to Automate Statistical Computation

Data & Analytics

Transcript of One Algorithm to Rule Them All: How to Automate Statistical Computation