One Algorithm to Rule Them All: How to Automate Statistical Computation
-
Upload
work-bench -
Category
Data & Analytics
-
view
7.445 -
download
2
Transcript of One Algorithm to Rule Them All: How to Automate Statistical Computation
Can one algorithm rule them all?
How to automate statistical computations
Alp Kucukelbir
COLUMBIA UNIVERSITY
Can one algorithm rule them all?
Not yet. (But some tools can help!)
Rajesh Ranganath Dustin Tran
Andrew Gelman David Blei
Machine Learning
datamachinelearning
hiddenpatterns
We want to discover and explore hidden patterns� to study hard-to-see connections,� to predict future outcomes,� to explore causal relationships.
How taxis navigate the city of Porto [1.7m trips] (K et.al., 2016).
How do we use machine learning?
statistical model
datamachinelearningexpert
hiddenpatterns
many months later
statistical model
datamachinelearningexpert
hiddenpatterns
many months later
statistical model
datamachinelearningexpert
hiddenpatterns
many months later
Statistical Model� Make assumptions about data.� Capture uncertainties using probability.
statistical model
datamachinelearningexpert
hiddenpatterns
many months later
Statistical Model� Make assumptions about data.� Capture uncertainties using probability.
statistical model
datamachinelearningexpert
hiddenpatterns
many months later
Statistical Model� Make assumptions about data.� Capture uncertainties using probability.
Machine Learning Expert� aka a PhD student.
statistical model
datamachinelearningexpert
hiddenpatterns
many months later
Statistical Model� Make assumptions about data.� Capture uncertainties using probability.
Machine Learning Expert� aka a PhD student.
statistical model
datamachinelearningexpert
hiddenpatterns
many months later
Machine learning should be
1. Easy to use 2. Scalable 3. Flexible.
statistical model
dataautomatic
toolhiddenpatternsinstant
revise
Machine learning should be
1. Easy to use 2. Scalable 3. Flexible.
“[Statistical] models are developed iteratively: we build amodel, use it to analyze data, assess how it succeeds andfails, revise it, and repeat.” (Box, 1960; Blei, 2014)
statistical model
dataautomatic
toolhiddenpatternsinstant
revise
Machine learning should be
1. Easy to use 2. Scalable 3. Flexible.
“[Statistical] models are developed iteratively: we build amodel, use it to analyze data, assess how it succeeds andfails, revise it, and repeat.” (Box, 1960; Blei, 2014)
What does this automatic tool need to do?
statistical model
datamachinelearningexpert
hiddenpatterns
many months later
statistical model
datainference(maths)
inference(algorithm)
hiddenpatterns
statistical model
datainference(maths)
inference(algorithm)
hiddenpatterns
X θ
Bayesian Model
likelihood p(X | θ )model p(X,θ ) = p(X | θ ) p(θ )
prior p(θ )
The model describes a data generating process.
The latent variables θ capture hidden patterns.
statistical model
datainference(maths)
inference(algorithm)
hiddenpatterns
X θ
Bayesian Model
likelihood p(X | θ )model p(X,θ ) = p(X | θ ) p(θ )
prior p(θ )
The model describes a data generating process.
The latent variables θ capture hidden patterns.
statistical model
datainference(maths)
inference(algorithm)
hiddenpatterns
X θ
Bayesian Inference
posterior p(θ | X) =p(X,θ )
∫
p(X,θ )dθ
The posterior describes hidden patterns given data X.
It is typically intractable.
statistical model
datainference(maths)
inference(algorithm)
hiddenpatterns
X θ
Approximating the Posterior
Sampling draw samples using MCMCVariational approximate using a simple function
The computations depend heavily on the model!
Common Statistical ComputationsExpectations
Eq(θ ;φ)
�
log p(X,θ )�
=
∫
log p(X,θ ) q(θ ;φ)dθ
Gradients (of expectations)
∇φEq(θ ;φ)
�
log p(X,θ )�
Maximization (by following gradients)
maxφEq(θ ;φ)
�
log p(X,θ )�
Automating ExpectationsMonte Carlo sampling
θ
f(θ )
a a+ 1θ
f(θ )
a a+ 1
f(θ (s))
∫ a+1
a
f(θ )dθ ≈1S
S∑
s=1
f(θ (s))
where θ (s) ∼ Uniform(a, a+ 1)
Automating ExpectationsMonte Carlo sampling
Eq(θ ;φ)
�
log p(X,θ )�
=
∫
log p(X,θ ) q(θ ;φ)dθ
≈1S
S∑
s=1
log p(X,θ (s))
where θ (s) ∼ q(θ ;φ)
Monte Carlo Statistical Methods, Robert and Casella, 1999
Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009
Automating ExpectationsProbability Distributions� Stan, GSL (C++)� NumPy, SciPy, edward (Python)� built-in (R)� Distributions.jl (Julia)
Automating GradientsSymbolic or Automatic Differentiation
Let f(x1, x2) = log x1+x1x2−sin x2. Compute ∂ f(2, 5)/∂ x1.
Automatic di↵erentiation in machine learning: a survey 9
Table 2 Forward mode AD example, with y = f(x1, x2) = ln(x1) + x1x2 � sin(x2) at
(x1, x2) = (2, 5) and setting x1 = 1 to compute @y@x1
. The original forward run on the left
is augmented by the forward AD operations on the right, where each line supplements theoriginal on its left.
Forward Evaluation Trace
v�1 = x1 = 2
v0 = x2 = 5
v1 = ln v�1 = ln 2
v2 = v�1⇥v0 = 2 ⇥ 5
v3 = sin v0 = sin 5
v4 = v1 + v2 = 0.693 + 10
v5 = v4 � v3 = 10.693 + 0.959
y = v5 = 11.652
Forward Derivative Trace
v�1 = x1 = 1
v0 = x2 = 0
v1 = v�1/v�1 = 1/2
v2 = v�1⇥v0+v0⇥v�1 = 1⇥5+0⇥2
v3 = v0 ⇥ cos v0 = 0 ⇥ cos 5
v4 = v1 + v2 = 0.5 + 5
v5 = v4 � v3 = 5.5 � 0
y = v5 = 5.5
each intermediate variable vi a derivative
vi =@vi
@x1.
Applying the chain rule to each elementary operation in the forward evalu-ation trace, we generate the corresponding derivative trace, given on the righthand side of Table 2. Evaluating variables vi one by one together with theircorresponding vi values gives us the required derivative in the final variablev5 = @y
@x1.
This generalizes naturally to computing the Jacobian of a function f :Rn ! Rm with n independent variables xi and m dependent variables yj .In this case, each forward pass of AD is initialized by setting only one ofthe variables xi = 1 (in other words, setting x = ei, where ei is the i-th unitvector). A run of the code with specific input values x = a would then compute
yj =@yj
@xi
����x=a
, j = 1, . . . , m ,
giving us one column of the Jacobian matrix
Jf =
264
@y1
@x1· · · @y1
@xn
.... . .
...@ym
@x1· · · @ym
@xn
375
�������x = a
evaluated at point a. Thus, the full Jacobian can be computed in n evaluations.Furthermore, forward mode AD provides a very e�cient and matrix-free
way of computing Jacobian-vector products
Jf r =
264
@y1
@x1· · · @y1
@xn
.... . .
...@ym
@x1· · · @ym
@xn
375
264
r1
...rn
375 , (4)
Automatic differentiation in machine learning: a survey, Baydin
et al., 2015
#i n c l u d e < stan /math . hpp>
i n t main ( ) {us ing namespace std ;stan : : math : : var x1 = 2 , x2 = 5 ;
stan : : math : : var f ;f = log ( x1 ) + x1*x2 - s i n ( x2 ) ;cout << " f ( x1 , x2 ) = " << f . va l ( ) << endl ;
f . grad ( ) ;
cout << " df / dx1 = " << x1 . adj ( ) << endl<< " df / dx2 = " << x2 . adj ( ) << endl ;
r e turn 0 ;}
The Stan math library, Carpenter et al., 2015
Automating GradientsAutomatic Differentiation� Stan, Adept, CppAD (C++)� autograd, Tensorflow (Python)� radx (R)� http://www.juliadiff.org/ (Julia)
Symbolic Differentiation� SymbolicC++ (C++)� SymPy, Theano (Python)� Deriv, Ryacas (R)� http://www.juliadiff.org/ (Julia)
Stochastic Optimization� Follow noisy unbiased gradients.
8.5. Online learning and stochastic optimization 265
black line = LMS trajectory towards LS soln (red cross)
w0
w1
−1 0 1 2 3−1
−0.5
0
0.5
1
1.5
2
2.5
3
(a)
0 5 10 15 20 25 303
4
5
6
7
8
9
10
RSS vs iteration
(b)
Figure 8.8 Illustration of the LMS algorithm. Left: we start from θ = (−0.5, 2) and slowly convergingto the least squares solution of θ = (1.45, 0.92) (red cross). Right: plot of objective function over time.Note that it does not decrease monotonically. Figure generated by LMSdemo.
where i = i(k) is the training example to use at iteration k. If the data set is streaming, we usei(k) = k; we shall assume this from now on, for notational simplicity. Equation 8.86 is easyto interpret: it is the feature vector xk weighted by the difference between what we predicted,yk = θT
k xk , and the true response, yk ; hence the gradient acts like an error signal.After computing the gradient, we take a step along it as follows:
θk+1 = θk − ηk(yk − yk)xk (8.87)
(There is no need for a projection step, since this is an unconstrained optimization problem.)This algorithm is called the least mean squares or LMS algorithm, and is also known as thedelta rule, or the Widrow-Hoff rule.
Figure 8.8 shows the results of applying this algorithm to the data shown in Figure 7.2. Westart at θ = (−0.5, 2) and converge (in the sense that ||θk − θk−1||22 drops below a thresholdof 10−2) in about 26 iterations.
Note that LMS may require multiple passes through the data to find the optimum. Bycontrast, the recursive least squares algorithm, which is based on the Kalman filter and whichuses second-order information, finds the optimum in a single pass (see Section 18.2.3). See alsoExercise 7.7.
8.5.4 The perceptron algorithm
Now let us consider how to fit a binary logistic regression model in an online manner. Thebatch gradient was given in Equation 8.5. In the online case, the weight update has the simpleform
θk = θk−1 − ηkgi = θk−1 − ηk(µi − yi)xi (8.88)
where µi = p(yi = 1|xi, θk) = E [yi|xi, θk]. We see that this has exactly the same form as theLMS algorithm. Indeed, this property holds for all generalized linear models (Section 9.3).
Figure 8.8a.
� Scale up by subsampling the data at each step.Machine Learning: a Probabilistic Perspective, Murphy, 2012
Stochastic OptimizationGeneric Implementations� Vowpal Wabbit, sgd (C++)� Theano, Tensorflow (Python)� sgd (R)� SGDOptim.jl (Julia)
ADVI (Automatic Differentiation Variational Inference)
An easy-to use, scalable, flexible algorithm
smc‐ tan.org
� Stan is a probabilistic programming system.1. Write the model in a simple language.2. Provide data.3. Run.
� RStan, PyStan, Stan.jl, ...
How taxis navigate the city of Porto [1.7m trips] (K et.al., 2016).
Exploring Taxi RidesData: 1.7 million taxi rides
� Write down a pPCA model. (∼minutes)� Use ADVI to infer subspace. (∼hours)� Project data into pPCA subspace. (∼minutes)� Write down a mixture model. (∼minutes)� Use ADVI to find patterns. (∼minutes)
� Write down a supervised pPCA model. (∼minutes)� Repeat. (∼hours)
� What would have taken us weeks→ a single day.
statistical model
dataautomatic
toolhiddenpatternsinstant
revise
Monte Carlo Statistical Methods, Robert and Casella, 1999
Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009
Automatic differentiation in machine learning: a survey, Baydin et al., 2015
The Stan math library, Carpenter et al., 2015
Machine Learning: a Probabilistic Perspective, Murphy, 2012
Automatic differentiation variational inference, K et al., 2016
proditus.com mc-stan.org Thank you!
EXTRA SLIDES
Kullback Leibler Divergence
KL (q(θ ) ‖ p(θ | X)) =∫
θ
q(θ ) logq(θ )
p(θ | X)dθ
= Eq(θ )
�
logq(θ )
p(θ | X)
�
= Eq(θ ) [log q(θ )− log p(θ | X)]
Related Objective Function
L (φ) = log p(X)− KL (q(θ ) ‖ p(θ | X))= log p(X)−Eq(θ ) [log q(θ )− log p(θ | X)]= log p(X) +Eq(θ ) [log p(X | θ )]−Eq(θ ) [log q(θ )]= Eq(θ ) [log p(θ ,X)]−Eq(θ ) [log q(θ )]
= Eq(θ ;φ)
�
log p(X,θ )�
cross-entropy
−Eq(θ ;φ)
�
log q(θ ; φ)�
entropy