Deep Exponential Families - Columbia Universityblei/papers/RanganathTangCharlinBlei2015.pdfcepts of...

Deep Exponential Families

Rajesh Ranganath Linpeng Tang Laurent Charlin David M. Blei

Princeton University Princeton University Columbia University Columbia University

frajeshr,[email protected],[email protected]

Abstract

We describe deep exponential families

(DEFs), a class of latent variable modelsthat are inspired by the hidden structuresused in deep neural networks. DEFs capturea hierarchy of dependencies between latentvariables, and are easily generalized to manysettings through exponential families. Weperform inference using recent “black box”variational inference techniques. We thenevaluate various DEFs on text and combinemultiple DEFs into a model for pairwiserecommendation data. In an extensive study,we show going beyond one layer improvespredictions for DEFs. We demonstrate thatDEFs find interesting exploratory structurein large data sets, and give better predictiveperformance than state-of-the-art models.

1 Introduction

In this paper we develop deep exponential families(DEFs), a flexible family of probability distributionsthat reflect the intuitions behind deep unsupervisedfeature learning. In a DEF, observations arise froma cascade of layers of latent variables. Each layer’svariables are drawn from an exponential family that isgoverned by the inner product of the previous layer’svariables and a set of weights.

As in deep unsupervised feature learning, a DEF rep-resents hidden patterns, from coarse to fine grained,that compose with each other to form the observa-tions. DEFs also enjoy the advantages of probabilistic

Appearing in Proceedings of the 18th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2015, San Diego, CA, USA. JMLR: W&CP volume 38.Copyright 2015 by the authors.

modeling. Through their connection to exponentialfamilies [7], they support many kinds of data. Over-all DEFs combine the powerful representations of deepnetworks with the flexibility of graphical models.

Consider the problem of modeling documents. We canrepresent a document as a vector of term counts mod-eled with Poisson random variables [9]. In one type ofDEF, the rate of each term’s Poisson count is an innerproduct of a layer of latent variables (one level up fromthe terms) and a set of weights that are shared acrossdocuments. Loosely, we can think of the latent layerthe observations as per-document “topic” activations,each of which ignites a set of related terms via their in-ner product with the weights. These latent topics are,in turn, modeled in a similar way, conditioned on alayer above of “super topics.” Just as the topics grouprelated terms, the super topics group related topics,again via the inner product.

Figure 1 illustrates an example of a three level DEFuncovered from a large set of articles in The New YorkTimes. (This style of model, though with di↵erent de-tails, has been previously studied in the topic model-ing literature [21].) Conditional on the word countsof the articles, the DEF defines a posterior distribu-tion of the per-document cascades of latent variablesand the layers of weights. Here we have visualizedtwo third-layer topics which correspond to the con-

cepts of “Government” and “Politics”. We focus on“Government” and notice that the model has discov-ered, through its second-layer super-topics, the threebranches of government: judiciary ( left), legislative(center) and executive (right).

This is just one example. In a DEF, the latent vari-ables can be from any exponential family: Bernoullilatent variables recover the classical sigmoid beliefnetwork [25]; Gamma latent variables give somethingakin to deep version of nonnegative matrix factoriza-tion [20]; Gaussian latent variables lead to the types ofmodels that have recently been explored in the context


of computer vision [28]. DEFs fall into the broad classof stochastic feed forward networks defined by Neal[25]. These networks di↵er from the undirected deepprobabilistic models [30, 37] in that they allow for ex-plaining away, where latent variables compete to ex-plain the observations.

In addition to varying types of latent variables, we canfurther change the prior on the weights and the obser-vation model. Observations can be real valued, suchas those from music and images, binary, such as thosein the sigmoid belief network, or multinomial, such aswhen modeling text. In the language of neural net-works, the prior on the weights amounts to choosing atype of regularization; the observation model amountsto choosing a type of loss.

Finally, we can embed the DEF in a more complexmodel, building ”deep” versions of traditional modelsfrom the statistics and machine learning research lit-erature. As examples, the DEF can be made part ofa multi-level model of grouped data [10], time-seriesmodel of sequential data [3], or a factorization modelof pairwise data [31]. As am concrete example, we willdevelop and study the double DEF. The double DEFmodels a matrix of pairwise observations, such as usersrating items. It uses two DEFs, one for the latent rep-resentation of users and the other for items. The ob-servation of each user/item interaction combines thelowest layer of their individual DEF representations.

In the rest of this paper, we will define, develop, andstudy deep exponential families. We will explain someof their properties and situate them in the larger con-texts of probabilistic models and deep neural networks.We will then develop generic variational inference algo-rithms for using DEFs. We will show how to use themto solve real-world problems with large data sets, andwe will extensively study many DEFs on the problemsof document modeling and collaborative filtering. Weshow that DEF-variants of existing ”shallow” modelsgive more interesting exploratory structure and bet-ter predictive performance. More generally, DEFs area flexible class of models which, along with our algo-rithms for computing with them, let us easily explorea rich landscape of solutions for modern data analysisproblems.

In this section we review exponential families andpresent deep exponential families.

Exponential families. Exponential families [7] arean important class of distributions with convenientmathematical properties. Their form is

p.x/ D h.x/ exp.⌘>T .x/ � a.⌘//;where h is the base measure, ⌘ are the natural param-eters, T are the su�cient statistics, and a is the log-

normalizer. The expectation of the su�cient statisticsof an exponential family is the gradient of the log-normalizer EŒT .x/ç D r

⌘

a.⌘/. Exponential familiesare completely specified by their su�cient statisticsand base measure; di↵erent choices of h and T leadto di↵erent distributions. For example, in the normaldistribution the base measure is h D

p.2⇡/ and the

su�cient statistics are T .x/ D Œx; x2ç; and for the Betadistribution, a distribution with support over .0; 1/,the base measure is h D 1 and su�cient statistics areand T .x/ D Œlog x; log 1 � xç.

Deep exponential families. To construct deep ex-ponential families, we chain exponential families to-gether in a hierarchy, where the draw from one layercontrols the natural parameters of the next.

For each data point x

n

, the model has L lay-ers of hidden variables fz

n;1

; :::; zn;L

g, where eachzn;`

D fzn;`;1

; :::; z

n;`;K`g. We assume that z

n;`;k

is a scalar, but the model generalizes beyond this.Shared across data, the model has L � 1 layers ofweights fW

1

; :::WL�1g, where each W

`

is a collec-tion of K

`

vectors, each one with dimension K

`C1:W`

D fw`;1

; :::w`;K`g. We assume the weights have

a prior distribution p.W`

/.

For simplicity, we omit the data index n and describethe distribution of a single data point x. First, thetop layer of latent variables are drawn given a hyper-parameter ⌘

p.z

L;k

/ D expfam

L

.z

L;k

; ⌘/;

where the notation expfam.x; ⌘/ denotes x is drawnfrom an exponential family with natural parameter ⌘.1

Next, each latent variable is drawn conditional on theprevious layer,

p.z

`;k

j z`C1;w`;k/ D expfam

`

.z

`;k

; g

`

.z>`C1w`;k//:

(1)The function g

`

maps the inner product to the naturalparameter. Similar to the literature on generalizedlinear models [26], we call it the link function. Figure 2depicts this conditional structure in a graphical model.

Confirm that the dimensions work: z

`;k

is a scalar;z`C1 is a K

`C1 vector and w`;k is a column from a

K

`C1 ⇥ K` dimension matrix. Note each of the K`

variables in layer ` depends on all the variables of thehigher layer. This gives the model the flavor of a neuralnetwork. The subscript ` on expfam indicates thetype of exponential family can change across layers.This hierarchy of latent variables defines the DEF.

1We are loose with the base measure h as it can beabsorbed into the dominating measure.

Rajesh Ranganath, Linpeng Tang, Laurent Charlin, David M. Blei

chargesfederalcasetrial

investigationjuryattorneyguilty

courtjudgelaw

supremefederaljusticecaselegal

policemanfirekilleddeathofficerofficersshot

senatehousebill

congresssenatorcommitteedemocrat

representative

taxinsurancepaymoneyfundincometaxespercent

programprogramssystem

governmentpublicplanfederalbudget

thinkrightproblemdifferentpointfactsenselittle

schoolstudentsschoolseducationcollegestudentteachersboard

campaignbushdukakisjacksonpresidentialdemocraticrepublicancandidates

officialsmeetingagreementplanofficialtalksdecisionconference

partygovernmentpoliticalministercountryleaderoppositionpower

editorbooktimesmagazinenewspaperpressnewsbooks

televisionnewscbsradionbcnetworkcable

broadcast

churchcatholicreligiousjohnpopejewishromanpaul

“Media and Religion”“Judicial” “Legislative” “Executive”

“Government” “Politics”

“Political Parties”

Figure 1: A fraction of the three layer topic hierarchy on 166K The New York Times articles. The top wordsare shown for each topic. The arrows represent hierarchical groupings.

z

n;`;k

w�,k

w�+1,k z

n;`C1;k

K`

KL

z

n;L;k

⌘

K1

z

n;1;k

w

1,k

x

n;i

w

0,i

V

��

��

N

K`C1

Figure 2: The deep exponential family with V obser-vations.

DEFs can also be understood as random e↵ects mod-els [11] where the variables are controlled by the prod-uct of a weight vector and a set of latent covariates.

Likelihood. The data are drawn conditioned on thelowest layer of the DEF, p.x

n;i

j zn;1/. Separating thelikelihood from the DEF will allow us to compose andembed DEFs in other models. Later, we provide anexample where we combine two DEFs to form a modelfor pairwise data.

In this paper we focus on count data, thus we use the

Poisson distribution as the observation likelihood. ThePoisson distribution with mean � is

p.x

n;i

D x/ D e��x

xä

:

If we let xn;i

be the count of type i associated withobservation n, then x

n;i

’s distribution is

p.x

n;i

j z1

;W0

/ D Poisson.z>n;1

w0;i

/;

The observation weights W0

is matrix where each en-try is gamma distributed. We will discuss gamma dis-tribution further in the next section.

Returning to the example from the introduction ofmodeling documents, the x

n

are a vector of termcounts. The observation weights W

0

put positive masson groups of terms and thus form “topics.” Similarly,the weights on the second layer represents “super top-ics,” and the weights on the third layer represent “con-cepts.” The distribution p.z

n;1

j zn;2

;W1

/ representsthe distribution of “topics” given the “super topics”of a document. Figure 1 depicts the compositionaland sharing semantics of DEFs.

The link function. Here we explore some of theconnections between neural networks and deep expo-nential families. As we discussed, the latent variablelayers in deep exponential families are connected to-gether via a link function, g

`

. This link function spec-ifies the natural parameters for z

`;k

from z>`C1w`;k .

Using properties of exponential families we can deter-mine how the link function alters the distribution ofthe `th layer. The moments of the su�cient statisticsof an exponential family are given by the gradient ofthe log-normalizer r

⌘

a.⌘/. These moments completelyspecify the exponential family [7]. Thus in DEFs, themean of the next layer is controlled by the link func-tion g

`

via the gradient of the log-normalizer,

EŒT .z`;k

/ç D r⌘

a.g

l

.z>`C1w`;k//: (2)


Consider the case of the identity link function, whereg

l

.x/ D x. In this case, the expectation of z`

in deepexponential families is a linear function of the weightsand previous layer transformed by the gradient of thelog-normalizer. This transformation of the expectationis one source of non-linearity in DEFs. It parallels thenon-linearities used in neural networks.

To be clear, here is a representation of howthe values and weights of one layer controlthe expected su�cient statistics at the next:

zn,�+1

w�,k

g��a

E[T (zn,�,k)]

For example, in the sigmoid belief network [25], we willsee that the identity link function recovers the sigmoidtransformation used in neural networks.

2 Examples

To illustrate the potential of deep exponential families,we present three examples: the sparse gamma DEF,the sigmoid belief network, and a latent Poisson DEF.

Sparse gamma DEF. The sparse gamma DEF isa DEF with gamma distributed layers. The gammadistribution is an exponential family distribution withsupport over the positive reals. The probability den-sity of the gamma distribution with natural parame-ters, ˛ and ˇ, is

p.z/ D z�1 exp.˛ log.z/ � ˇz � log� .˛/ � ˛ log.ˇ//:

where � is the gamma function. The expectation ofthe gamma distribution is EŒzç D ˛ˇ�1.Through the link function in DEFs, the inner productof the previous layer and the weights control the nat-ural parameters of the next layer. For sparse gammamodels, we let control the expected activation of thenext layer, while the shape at each layer remains fixedfor each layer at ˛

`

. From the expectation of thegamma distribution, these conditions mean the linkfunction can be given as

g

˛

D ˛`

; g

ˇ

D ˛

`

z>`C1w`;k

:

As the expectation of gamma variables needs to bepositive, we let the weight matrices be gamma dis-tributed as well.

We set the shape parameters for the weights and layersto be less than 1. When the shape is less than 1,gamma distributions put most of their mass near zero;we call this type of distribution sparse gamma. Thistype of distribution is akin to a soft spike-slab prior

[16]. Spike and slab priors have shown to perform wellon feature selection and unsupervised feature discovery[12, 14].

The sparse gamma distribution di↵ers from distribu-tions such as the normal and Poisson in how the prob-ability mass moves given a change in the mean. For ex-ample, when the expected value is high, draws from thePoisson distribution are likely to be much larger thanzero, while in the sparse gamma distribution drawswill either be close to zero or very large. This is likegrowing the slab of our soft spike and slab prior. Wevisually demonstrate this in the appendix.

We estimate the posterior on this DEF using one tothree layers for two large text copora: Science andThe New York Times (NYT). We defer the discussionof the details of the corpora to Section 5. The topichierarchy shown earlier in Figure 1 is from a threelayer sparse gamma DEF. In the appendix we presenta portion of the Science hierarchy.

Sigmoid belief network. The sigmoid belief net-work [23, 25] is a widely used deep latent variablemodel. It consists of latent Bernoulli layers where themean of a feature at layer ` is given by a linear com-bination of the features at layer `C1 with the weightspassed through the sigmoid function.

This is a special case of a deep exponential family withBernoulli latent layers and the identity link function.To see this, use Eq. 1 to form the Bernoulli conditionalof a hidden layer,

p.z

`;k

j z`C1;w`;k/

D exp.z>`C1w`;kz`;k � log.1C exp.z>

`C1w`;k///

where z`;k

2 f0; 1g.Using Eq. 2, the expectation of z

`;k

is the derivativeof the log-normalizer of the Bernoulli. This derivativeis the logistic function. Thus, a DEF with Bernoulliconditionals and identity link function recovers the sig-moid belief network. The weights are real valued, sowe set p.W

`

/ to be a factorized normal distribution.

In the sigmoid belief network, we allow for the naturalparameters to have intercepts. The intercepts providea baseline activation for each feature independent ofthe shared weights.

Poisson DEF. The Poisson distribution is a distri-bution over counts that models the number of eventsoccurring in an interval. Its exponential family formwith parameter ⌘ is

p.z/ D zä�1 exp.⌘z � exp.⌘//;

where the mean of this distribution is e⌘.


We define the Poisson deep exponential family as aDEF with latent Poisson levels and log-link function. 2

This corresponds to the following conditional distribu-tion for the layers

p.z

`;k

j z`C1;w`;k/

D .z`;k

ä/

�1 exp.log.z>`C1w`;k/z`;k � z>

`C1w`;k/:

In the case of document modeling, the value of z2;k

represents how many times “super topic” k is repre-sented in this example.

Using the link function property of DEFs describedearlier, the mean activation at the next layer is givenby the gradient of the log-normalizer

EŒzlk

ç D r⌘

a.log.z>`C1w`;k//:

For the Poisson, a is the exponential function. Thus itsderivative is the exponential function. This means themean of the next layer is equal to a linear combinationof the weights of the previous layer. Our choice of linkfunction requires positivity on the weights, so we letp.W

`

/ be a factorized gamma distribution.

We also consider Poisson DEFs with real valuedweights to allow negative relations between lower lay-ers and higher layers. We set the prior on the weightsto be Gaussian in this case. We use log-softmax⌘ D log.log.1C exp.�zT

`C1w`;k/// as the link function,where the function inside the first log is the softmaxfunction. It preserves approximate linear relations be-tween mean activation and the inner product zT

`C1w`;kwhen it is large while allowing for the inner productto take negative values as well.

Similar to the sigmoid belief network, we allow thenatural parameter to have an intercept. This type ofPoisson DEF can be seen as an extension of the sig-moid belief network, where each observation expressesan integer count number of a feature rather than a bi-nary feature. Table 1 summarizes the DEFs we havedescribed and will study in our experiments.

3 Related Work

Graphical models and neural nets have a long and dis-tinguished history. A full review is outside of the scopeof this article, however we highlight some key results asthey relate to DEFs. More generally, deep exponentialfamilies fall into the broad class of stochastic feed for-ward belief networks [25], but Neal [25] focuses mainlyon one example in this class, the sigmoid belief net-work, which is a binary latent variable model. Severalexisting stochastic feed forward networks are DEFs,

2We add an intercept to ensure positivity of the rate.

such as latent Gaussian models [28] and the sigmoidbelief network with layerwise dependencies [23].

Undirected graphical models have also been used in in-ferring compositional hierarchies. Salakhutdinov andHinton [30] propose deep probabilistic models based onRestricted Boltzmann Machines (RBMs) [36]. RBMsare a two layer undirected probabilistic model with onelayer of latent variables and one layer of observationstied together by a weight matrix. Directed modelssuch as DEFs have the property of explaining away,where independent latent variables under the prior be-come dependent conditioned on the observations. Thisproperty makes inference harder than in RBMs, butforces a more parsimonious representation where sim-ilar features compete to explain the data rather thanwork in tandem [12, 1].

RBMs have been extended to general exponential fam-ily conditionals in a model called exponential familyharmoniums (EFH) [39]. A certain infinite DEF withtied weights is equivalent to an EFH [15], but as ourweights are not tied, deep exponential families repre-sent a broader class of models than exponential familyharmoniums (and RBMs).

The literature of latent variable models relates toDEFs through hierarchical models and Bayesian fac-tor analysis. Latent tree hierarchies have been con-structed with specific distributions (Dirchlet) [21],while Bayesian factor analysis methods such as expo-nential family PCA [24] and multinomial PCA [8] canbe seen as a single layer deep exponential family.

4 Inference

The central computational problem for working withDEFs is posterior inference. The intractability of thepartition function means posterior computations re-quire approximations. Past work on sigmoid beliefnetworks has proposed doing greedy layer-wise learn-ing (for a specific kind of network) [15]. Here, instead,we develop variational methods [35] that are applica-ble to general DEFs.

Variational inference [18] casts the posterior inferenceproblem as an optimization problem. Variational al-gorithms seek to minimize the KL divergence to theposterior from an approximating distribution q. Thisis equivalent to maximizing the following [2],

L.q/ D Eq.z;W /

Œlogp.x; z;W / � log q.z;W /ç;

where z denotes all latent variables associated with theobservations and W all latent variables shared acrossobservations. This objective function is called the Ev-idence Lower BOund (ELBO) because it is a lowerbound on logp.x/.


z-Dist z`C1 W-dist w

`;k

g

`

EŒT .z`;k

/ç

Gamma R

K`C1

C Gamma R

K`C1

C [constant; inverse] Œz

>`C1w`;k ; .˛`/ � log.˛/C log.z>

`C1w`;k/çBernoulli f0; 1gK`C1 Normal R

K`C1 identity �.z

>`C1w`;k/

Poisson N

K`C1 Gamma R

K`C1

C log z

>`C1w`;k

Poisson N

K`C1 Normal R

K`C1 log-softmax log.1C exp.z>`C1w`;k//

Table 1: A summary of all the DEFs we present in terms of their layer distributions, weight distributions, andlink functions.

For the approximating distribution, q, we use themean field variational family. In the mean field ap-proximating family, the distribution over the latentvariables factorizes. Let N be the number of obser-vations, then the variational family is

q.z;W / D q.W0

/

LY`D1

q.W`

/

NYnD1

q.zn;`

/;

where q.zn;`

/ and q.W

`

/ are fully factorized. Eachcomponent in q.z

n;`

/ is

q.z

n;`;k

/ D expfam

`

.z

n;`;k

;�

n;`;k/;

where the exponential family is the same one as themodel distribution p. Similarly, we choose q.W / to bein the same family as p.W / with parameters ⇠.

To maximize the ELBO, we need to compute expecta-tions under the approximation q. These expectationsfor general DEFs will not have a simple analytic form.Thus we use more recent “black box” variational in-ference techniques that step around computing thisexpectation [40, 34, 27].

Black box variational inference methods use stochas-tic optimization[29] to avoid the analytic intractabilityof computing the objective function. Stochastic opti-mization works by following noisy unbiased gradients.In black box variational inference [27], the gradient ofthe ELBO with respect to the parameters of a latentvariable can be written as an expectation with respectto the variational approximation.

More formally, let pn;`;k

.x; z;W / be the terms in thelog-joint that contains z

n;`;k

(its Markov blanket), thenthe gradient for the approximation of z

n;`;k

is

r�n;l;k

L D Eq

Œr�n;`;k

log q.zn;`;k

/

.logpn;`;k

.x; z;W / � log q.zn;`;k

//ç:

We compute Monte Carlo estimates of this gradientby averaging the evaluation of the gradient at severalsamples. To compute the Monte Carlo estimate of thegradient, we need to be able to sample from the ap-proximation to evaluate the Markov blanket for eachlatent variable, the approximating distribution, and

the gradient of the log of the approximating distribu-tion (score function). We detail the score functions inthe appendix. From this equation, we can see that theprimary cost in computing the gradients is in evaluat-ing the likelihood and score function on a sample. Tospeed up our algorithm, we parallelize the likelihoodcomputation across samples.

The Markov blanket for a latent variable in the firstlayer of a DEF is

logpn;1;k

.x; z;W / D logp.zn;1;k

jzn;2

;w1;k

/

C logp.xn

jzn;1

;W0

/: (3)

The Markov blank for a latent variable in the interme-diate layer is

logpn;`;k

.x; z;W / D logp.zn;`;k

jzn;`C1;w`;k/

C logp.zn;`�1jzn;`;W`�1/: (4)

The Markov blanket for the top layer is

logpn;L;k

.x; z;W / D logp.zn;L;k

/

C logp.zn;L�1jzn;L;WL�1/: (5)

The gradients for W can be written similarly.

Stochastic optimization requires a learning rate toscale the noisy gradients before applying them. Weuse RMSProp which scales the gradient by the squareroot of the online average of the squared gradient.3

RMSProp captures the varying length scales and noisethrough the sum of squares term used to normalize thegradient step. We present a sketch of the algorithm inAlg. 1, and present the full algorithm in the appendix.

5 Experiments

We have introduced DEFs and detailed a procedurefor posterior inference in DEFs. We now provide anextensive evaluation of DEFs. We report predictiveresults from 28 di↵erent DEF instances where we ex-plore the number of layers (1, 2 or 3), the latent vari-able distributions (gamma, Poisson, Bernoulli) and theweight distributions (normal, gamma) using a Poissonobservational model. Furthermore, we instantiate and

3

www.cs.toronto.edu/

~

tijmen/csc321/slides/lecture_slides_lec6.pdf


Algorithm 1 BBVI for DEFs

Input: data X , model p, L layers.Initialize �; ⇠ randomly, t D 1.repeat

Sample a datapoint xfor s = 1 to S do

z

x

Œsç; W Œsç ⇠ qpŒsç D logp.z

x

Œsç; W Œsç; x/

qŒsç D log q.zx

Œsç; W Œsç/

gŒsç D r log q.zx

Œsç; W Œsç/

end for

Compute gradient using BBVIUpdate variational parameters for z and W

until change in validation likelihood is small

report results using a combination of two DEFs forpairwise data.

Our results:

✏ Show improvements over strong baselines for bothtopic modeling and collaborative filtering on a to-tal of four corpora.✏ Lead us to conclude that deeper DEFs and sparse

gamma DEFs display the strongest performanceoverall.

5.1 Text Modeling

We consider two large text corpora Science and The

New York Times. Science consists of 133K documentsand 5.9K terms. The New York Times consists of166K documents and 8K terms.

Baselines. As a baseline we consider Latent Dirich-let Allocation [5] a popular topic model, and state-of-the-art DocNADE [19]. DocNADE estimates theprobability of a given word in a document given thepreviously observed words in that document. In Doc-NADE, the connections between each observation andthe latent variables used to generate the observationsare shared.

We note that the one layer sparse gamma DEF isequivalent to Poisson matrix factorization [9, 13] butour model is fully Bayesian and our variational distri-bution is collapsed.

Evaluation. We compute perplexity on a held outset of 1,000 documents. Held out perplexity is givenby

exp

✓�Pd2docs

Pw2d logp.w j# held out in d/

N

held out words

◆

Conditional on the total number of held out words,the distribution of the held out words becomes multi-

Model W NYT Science

LDA [6] 2717 1711DocNADE [19] 2496 1725

Sparse Gamma 100 ; 2525 1652Sparse Gamma 100-30 � 2303 1539

Sparse Gamma 100-30-15 � 2251 1542Sigmoid 100 ; 2343 1633

Sigmoid 100-30 N 2653 1665Sigmoid 100-30-15 N 2507 1653

Poisson 100 ; 2590 1620Poisson 100-30 N 2423 1560

Poisson 100-30-15 N 2416 1576Poisson log-link 100-30 � 2288 1523

Poisson log-link 100-30-15 � 2366 1545

Table 2: Perplexity on a held out collection of 1K Sci-

ence and NYT documents. Lower values are better.The DEF W column indicates the type of prior distri-bution over the DEF weights, � for the gamma priorand N for normal (recall that one layer DEFs consistonly of a layer of latent variables, thus we representtheir prior with the ;).

nomial. The mean of the conditional multinomial isgiven by the normalized Poisson rate in each docu-ment. We set the rates to the expected value under thevariational distribution. Additionally, we let all meth-ods see ten percent of the words in each document; theother ninety percent form the held out set. This is sim-ilar to the document completion evaluation metric [38]except we query the test words independently. We usethe observed ten percent to compute the variationaldistribution for the document specific latent variables,the DEF for the document, while keeping the approx-imation on the shared weights fixed. In DocNADE,this corresponds to always seeing a fixed set of wordsfirst, then evaluating each new word given the first tenpercent of the document.

Held out perplexity di↵ers from perplexity computedfrom the predictive distribution p.x⇤ j x/. The formercan be a more di�cult problem as we only ever con-dition on a fraction of the document. Additionallycomputing perplexity from the predictive distributionrequires computationally demanding sampling proce-dures which for most models like LDA only allow test-ing of only a small number (50) of documents [38, 33].In contrast our held-out test metric can be quicklycomputed for 1,000 test documents.

Architectures and hyperparameters. We buildone, two and three layer hierarchies of the sparsegamma DEF, sigmoid belief network, Poisson DEF,and log-link Poisson DEF. The sizes of the layers are100, 30, and 15, respectively. We note that while dif-ferent DEFs may have better predictive performance


Model Netflix Perplexity Netflix NDCG ArXiv Perplexity ArXiv NDCGGaussian MF [32] – 0.008 – 0.013

1 layer Double DEF 2319 0.031 2138 0.0492 layer Double DEF 2299 0.022 1893 0.0503 layer Double DEF 2296 0.037 1940 0.053

Table 3: A comparison of a matrix factorization methods on Netflix and the ArXiv. We find that the DoubleDEFs outperform the shallow ones on perplexity. We also find that the NDCG of around 100 low-activity users(users with less than 5 and 10 observations in the observed 10% of the held-out set respectively for Netflix andArXiv). We use Vowpal Wabbit’s MF implementation which does not readily provide held-out likelihoods andthus we do not report the perplexity associated with MF.

at di↵erent sizes, we consider DEFs of a fixed size aswe also seek a compact explorable representation ofour corpus. One hundred topics fall into the rangeof topics searched in the topic modeling literature [4].We detail the hyperparameters in the appendix.

We observe two phases to DEF convergence; it con-verges quickly to a good held-out perplexity (around2,000 iterations) and then slowly improves until finalconvergence (around 10,000 iterations). Each iterationtakes approximately 15 seconds on a modern 32-coremachine (from Amazon’s AWS).

Results. Table 2 summarizes the predictive resultson both corpora. We note that DEFs outperform thebaselines on both datasets. Furthermore moving be-yond one layer models generally improves performanceas expected. The table also reveals that stacking lay-ers of gamma latent variables always leads to simi-lar or better performance. Finally, as shown by thePoisson DEFs with di↵erent link functions, we findgamma-distributed weights to outperform normally-distributed weights. Somewhat related, we find sig-moid DEFs (with normal weights) to be more di�cultto train and deeper version perform poorly.

5.2 Matrix Factorization

Previously, we constructed models out of a single DEF,but DEFs can be embedded and composed in morecomplex models. We now present double DEF, a fac-torization model for pairwise data where both the rowsand columns are determined by DEFs. The graphicalmodel of the double DEF corresponds to replacing W

0

in Figure 2 with another DEF.

We focus on factorization of counts (ratings, clicks).The observed data are generated with a Poisson.The observation likelihood for this double DEF isp.x

n;i

j zcn;1

; zri;1

/ D Poisson.zcn;1

>zri;1

/, where zcn;1

isthe lowest layer of a DEF for the nth observation andzri;1

is the lowest layer of a DEF for the ith item. Thedouble DEF has hierarchies on both users and items.

We infer double DEFs on Netflix movie ratings and

click data from the ArXiv (www.arXiv.org) which in-dicates how many times a user clicked on a paper.Our Netflix collection consists of 50K users and 17.7Kmovies. The movie ratings range from zero to fivestars, where zero means the movie was unrated bythe user. The ArXiv collection consists of 18K usersand 20K documents. We fit a one, two, and threelayer double DEF where the sizes of the row DEFmatch the sizes of the column DEF at each layer.The sizes of the layers are 100, 30, and 15. Wecompare double DEFs to l2-regularized (Gaussian)matrix factorization (MF) [32]. We re-use the test-ing procedure introduced in the previous section (thisis referred to as strong-generalization in the recom-mendation literature [22]) where the held-out test setcontains one thousand users. For performance andcomputational reasons we subsample zero-observationsfor MF as is standard [13]. Further, we also reportthe commonly-used multi-level ranking measure (un-truncated) NDCG [17] for all methods.

Table 3 shows that two-layer DEFs improve perfor-mance over the shallow DEF and that all DEFs out-perform Gaussian MF. On perplexity the three layermodel performs similarly on Netflix and slightly worseon the ArXiv. The table further highlights thatwhen comparing ranking performance, the advantageof deeper models is especially clear on low-activityusers (NDCG across all test users is comparable withinthe three DEFs architectures and is not reported here).This data regime is of particular importance for practi-cal recommender systems. We postulate that this dueto the hierarchy in deeper models acting as a morestructured prior compared to single-layer models.

6 Discussion

We develop deep exponential families as a way to de-scribe hierarchical relationships of latent variables tocapture compositional semantics of data. We presentseveral instantiations of deep exponential families andachieve improved predictive power and interpretablesemantic structures for both problems in text model-ing and collaborative filtering.


Acknowledgements. We thank Prem Gopalan,Matt Ho↵man, and the reviewers for their helpfulcomments. RR is supported by an NDSEG fellow-ship. DMB is supported by NSF IIS-0745520, NSFIIS-1247664, ONR N00014-11-1-0651, and DARPAFA8750-14-2-0009.

References

[1] Y. Bengio, A. Courville, and P. Vincent. Representa-tion learning: A review and new perspectives. PatternAnalysis and Machine Intelligence, 35(8), 2013.

[2] C. Bishop. Pattern Recognition and Machine Learn-ing. Springer New York., 2006.

[3] D. Blei and J. La↵erty. Dynamic topic models. InInternational Conference on Machine Learning, 2006.

[4] D. Blei and J. La↵erty. A correlated topic model ofScience. Annals of Applied Stat., 1(1):17–35, 2007.

[5] D. Blei, T. Gri�ths, M. Jordan, and J. Tenenbaum.Hierarchical topic models and the nested Chineserestaurant process. In NIPS, 2003.

[6] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet al-location. Journal of Machine Learning Research, 3:993–1022, January 2003.

[7] L. Brown. Fundamentals of Statistical ExponentialFamilies. Institute of Mathematical Statistics, 1986.

[8] W. Buntine and A. Jakulin. Applying discrete PCA indata analysis. In Proceedings of the 20th Conferenceon Uncertainty in Artificial Intelligence, pages 59–66.AUAI Press, 2004.

[9] J. Canny. GaP: A factor model for discrete data. InProceedings of the 27th Annual International ACM SI-GIR Conference on Research and Development in In-formation Retrieval, 2004.

[10] A. Gelman and J. Hill. Data Analysis Using Regres-sion and Multilevel/Hierarchical Models. CambridgeUniv. Press, 2007.

[11] Andrew Gelman, John B Carlin, Hal S Stern, David BDunson, Aki Vehtari, and Donald B Rubin. Bayesiandata analysis. CRC press, 2013.

[12] I. Goodfellow, A. Courville, and Y. Bengio. Large-scale feature learning with spike-and-slab sparse cod-ing. In International Conference on Machine Learning(ICML), 2012.

[13] P. Gopalan, J. Ho↵man, and D. Blei. Scalable rec-ommendation with poisson factorization. arXiv,(1311.1704), 2013.

[14] Daniel Hernandez-Lobato, Jose Miguel Hernandez-Lobato, and Pierre Dupont. Generalized spike-and-slab priors for bayesian group feature selection us-ing expectation propagation. The Journal of MachineLearning Research, 14(1):1891–1945, 2013.

[15] G. Hinton, S. Osindero, and Y. Teh. A fast learningalgorithm for deep belief nets. Neural Comput., 18(7):1527–1554, July 2006. ISSN 0899-7667.

[16] H. Ishwaran and S. Rao. Spike and slab variable selec-tion: Frequentist and Bayesian strategies. The Annalsof Statistics, 33(2):730–773, 2005.

[17] Kalervo Jarvelin and Jaana Kekalainen. Ir evaluationmethods for retrieving highly relevant documents. InIn International ACM SIGIR Conference on Researchand Development in Information Retrieval, SIGIR ’00,pages 41–48. ACM, 2000. ISBN 1-58113-226-3.

[18] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul.Introduction to variational methods for graphicalmodels. Machine Learning, 37:183–233, 1999.

[19] Hugo Larochelle and Stanislas Lauly. A neural au-toregressive topic model. In Neural Information Pro-cessing Systems, 2012.

[20] D. Lee and H. Seung. Learning the parts of objects bynon-negative matrix factorization. Nature, 401(6755):788–791, October 1999.

[21] W. Li and A. McCallum. Pachinko allocation:DAG-stuctured mixture models of topic correlations. InICML, 2006.

[22] Benjamin Marlin. Collaborative filtering: A machinelearning perspective. Technical report, University ofToronto, 2004.

[23] A. Mnih and K. Gregor. Neural variational inferenceand learning in belief networks. In ICML, 2014.

[24] S. Mohamed, K. Heller, and Z. Ghahramani. Bayesianexponential family PCA. In NIPS, 2008.

[25] R. Neal. Learning stochastic feedforward networks.Tech. Rep. CRG-TR-90-7: Department of ComputerScience, University of Toronto, 1990.

[26] J. A. Nelder and R. W. M. Wedderburn. Generalizedlinear models. Journal of the Royal Statistical Society.Series A (General), 135:370–384, 1972.

[27] R. Ranganath, S. Gerrish, and D. Blei. Black boxvariational inference. In International Conference onArtifical Intelligence and Statistics, 2014.

[28] D. Rezende, S. Mohamed, and D. Wierstra. Stochasticbackpropagation and approximate inference in deepgenerative models. ArXiv e-prints, January 2014.

[29] H. Robbins and S. Monro. A stochastic approximationmethod. The Annals of Mathematical Statistics, 22(3):pp. 400–407, 1951.

[30] R. Salakhutdinov and G. Hinton. Deep boltzmannmachines. In AISTATS, pages 448–455, 2009.

[31] R. Salakhutdinov and A. Mnih. Bayesian probabilis-tic matrix factorization using Markov chain MonteCarlo. In International Conference on Machine learn-ing, 2008.

[32] R. Salakhutdinov and A. Mnih. Probabilistic matrixfactorization. In Neural Information Processing Sys-tems, 2008.

[33] Ruslan Salakhutdinov and Geo↵rey Hinton. Repli-cated softmax: an undirected topic model. In Y. Ben-gio, D. Schuurmans, J. La↵erty, C. K. I. Williams, andA. Culotta, editors, Advances in Neural InformationProcessing Systems 22, pages 1607–1614. 2009.

[34] T. Salimans and D. Knowles. Fixed-form variationalposterior approximation through stochastic linear re-gression. Bayesian Analysis, 8(4):837–882, 2013.

[35] L. Saul, T. Jaakkola, and M. Jordan. Mean field the-ory for sigmoid belief networks. Journal of ArtificialIntelligence Research, 4:61–76, 1996.


[36] P. Smolenksy. Information processing in dynami-cal systems: Foundations of harmony theory. Par-allel Distributed Processing: Explorations in the Mi-crostructure of Cognition, 1, 1986.

[37] Nitish Srivastava, Ruslan Salakhutdinov, and Geof-frey Hinton. Modeling documents with a deep boltz-mann machine. In UAI, 2013.

[38] H. Wallach, I. Murray, R. Salakhutdinov, andD. Mimno. Evaluation methods for topic models.In International Conference on Machine Learning(ICML), 2009.

[39] M. Welling, M. Rosen-Zvi, and G. Hinton. Exponen-tial family harmoniums with an application to infor-mation retrieval. In Neural Information ProcessingSystems (NIPS) 17, 2004.

[40] D. Wingate and T Weber. Automated variational in-ference in probabilistic programming. ArXiv e-prints,January 2013.

Deep Exponential Families (Appendix)

Rajesh Ranganath Linpeng Tang Laurent Charlin David M. BleiPrinceton University Princeton University Columbia University Columbia University

frajeshr,[email protected],[email protected]

Appendix

Gamma Distribution Figure 1 visually demon-strates how the sparse gamma distrubution and Pois-son distribution change when moving from low to highmean. We plot both the Poisson and sparse gammadistribution in both settings. The mode of the Pois-son distribution moves, while the slab grows for thegamma.

General Algorithm. Following the notation fromthe main paper, the general algorithm for mean fieldvariational inference in deep exponential families isgiven in (Alg. 1).

Properties of q. In our experiments, we use fourvariational families (Poisson, gamma, Bernoulli, andnormal). We detail the necessary score functions here.For the Poisson, the distribution is given by:

q.z/ D e

��z

zä

:

The score function is

@ log q.z/

@�

D �1 C z

�

:

For the gamma, we use the shape ˛ and scale ✓ asvariational parameters. The distribution is given by

q.z/ D 1

� .˛/✓

˛z

˛�1e

�z=✓:


@ log q.z/

@˛

D � .˛/ � log ✓; C log z

@ log q.z/

@✓

D �˛=✓ C z=✓

2;

where is the digamma function.

●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0

0.2

0.4

0.6

0.8

0 10 20 30 40

Mean = 0.5

●●

GammaPoisson

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0

0.2

0.4

0.6

0.8

0 10 20 30 40

Mean = 20

●●

GammaPoisson

Figure 1: Draws from the Poisson (blue) and sparsegamma distribution (orange) with low and high mean.The shape of the sparse gamma is held fixed. Notethe high mean shifts the Poisson, while does not shiftthe sparse gamma. Notice the spike-slab appearanceof the sparse gamma distribution.

Deep Exponential Families (Appendix)

For the Bernoulli distribution, we use the natural pa-rameterization with parameter ⌘ to form the varia-tional approximation. The distribution is

q.z/ D 1

1 C e

�.2z�1/⌘:


@ log q.z/

@⌘

D .2z � 1/

e

�.2z�1/⌘

1 C e

�.2z�1/⌘:

For the normal variational approximation, we use thestandard parameterization of mean � and variance �

2.The distribution is

q.z/ D 1p2⇡�

2exp .� .x � �/

2

2�

2/:


@ log q.z/

@�

D x � �

�

2

@ log q.z/

@�

2D � 1

2�

2C .x � �/

2

2�

4:

Parameterizations of Variational Distributions.Several of our variational parameters like the varianceof the normal have positive constraints To enforce posi-tivity constraints, we transform an unconstrained vari-able by log.1 C exp.x//. To avoid numerical issueswhen sampling, we truncate values when appropri-ate. Gradients of the unconstrained parameters areobtained with the chain rule from the score functionand the derivative of softmax: exp.x/=.1 C exp.x//.

Optimization We perform gradient ascent step onthe ELBO using

�✓ D ⇢� r✓ELBO (1)

⇢ is a fixed scalar set to 0:2 in our experiments.r✓ELBO is a noisy gradient estimated using BBVI. �

is a diagonal preconditioning matrix estimated usingthe RMSProp heuristic. A diagonal element of � is thereciprocal of the squared root of a running average ofthe squares of historical gradients of that component.We used a window size of 10 in our experiments.

Hyperparameters and Convergence We use thesame hyperparameters on Gamma distributions oneach layer with shape and rate 0:3. For the sigmoidbelief network we use a prior of 0:1 to achieve somesparsity as well. We fix the Poisson prior rate to be0:1. For gamma W ’s we use shape 0:1 and rate 0:3. ForGaussian W ’s we use a prior mean of 0 and varianceof 1. We let the experiments run for 10,000 iterationsat which point the validation likelihood is stable.

Algorithm 1 BBVI for DEFs

Input: data X , model p, L layers .Initialize �; ⇠ randomly, t D 1.repeat// Draw a single data point from Xn DUnif.D/

// Get S samples in parallelfor s D 1 to S do

z1Œsç ⇠ q.z1I �n;1/

W0Œsç ⇠ q.W0 j ⇠0/

p0Œsç D logp.xn j z1Œsç; W0Œsç/

q1Œsç D log q.z1ŒsçI �n;1/

g1Œsç D r�n;1log q.z1ŒsçI �n;1/

gW0Œsç D r⇠1

log q.W0/

pW0Œsç D logp.W0I ⇠1/

qW0Œsç D log q.W0I ⇠1/

for l D 2 to L dozl Œsç ⇠ q.zl I �n;l /

Wl�1Œsç ⇠ q.Wl�1 j ⇠l�1/

pl Œsç D logp.zl�1 j zl ; Wl�1Œsç/

ql Œsç D log q.zl I �n;l /

gl Œsç D r�n;llog q.zl I �n;l /

gWl�1Œsç D r⇠l�1

log q.Wl�1/

pWl�1Œsç D logp.Wl�1I ⇠l�1/

qWl�1Œsç D log q.Wl�1I ⇠l�1/

end forpLŒsç D logp.zL/

end for// Update parametersfor l D 1 to L do

for k D 1 to Kl doS D gl;k.pl�1 C pl;k � ql;k/

�n;1;k D �n;1;k C ⇢ mean.S/

end forT D gWl�1

.pWl�1� qWl�l

C pl�1/

⇠l�1 D ⇠l�1 C ⇢ mean.T /

end foruntil change of val likelihood is less than ✏.

For the double DEF, we set all shapes to 0:1 and ratesto 0:3. We let the Double DEF experiment run forabout 10,000 iterations. The validation likelihood hadconverged for all models by this point.


responsestimulusvisualresponsescortex

stimulationstimulibrain

acidactivitypercentenzymesynthesisfigproteinreaction

watersurfaceflowfluidsolutionlayersurfacesliquid

starsstar

observatorysunradiomass

observationsstellar

energyelectronelectronsmagneticparticlesatomsfieldlaser

currenttubeelectricmercuryglassmadeelectricalapparatus

yearsagerateratesyearperiodmilliontime

tabledatavaluesnumbervaluetotalpercentaverage

iceoceanseaclimatesurface

temperatureatmosphericwater

cellsfigcellproteinreceptorexpressiontype

activation

malefemalemalesfemalessex

chromosomeseggssexual

cancerhealthdiseasepatientsmedicalclinicaldrughuman

dnarnagenegenessequencefig

sequenceschromosome

cellscelltumormediumhumanculturenormalgrowth

bloodtissuemusclebrainskinanimalbodytissues

Figure 2: A fraction of the three layer topic hierarchy of the Science corpus. The top words are shown for each“topic.” The arrows represent hierarchical groupings. We choose top three components at each layer. Similar“topics” are grouped into “super topics.” The two “concepts” share a “super topic.”

Deep Exponential Families - Columbia Universityblei/papers/RanganathTangCharlinBlei2015.pdfcepts of...

Documents

Transcript of Deep Exponential Families - Columbia Universityblei/papers/RanganathTangCharlinBlei2015.pdfcepts of...