Introduction to Generalized Linear Models (GLMs)

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Introduction to Generalized Linear Models (GLMs)

Prof. Nicholas Zabaras Materials Process Design and Control Laboratory

Sibley School of Mechanical and Aerospace Engineering 101 Frank H. T. Rhodes Hall

Cornell University Ithaca, NY 14853-3801

Email: [email protected]

URL: http://mpdc.mae.cornell.edu/

January 10, 2014

1

mailto:[email protected]

http://mpdc.mae.cornell.edu/


Introduction, Canonical Link Function, Examples

MLE Estimation, Computing the Hessian, Bayesian Inference for GLMs

Probit Regression, Personalized Spam Filtering, Domain adaptation, Other Types of Prior

Generalized Linear Mixed Models, Computational Issues, Learning to Rank, Pairwise Approach, Listwise Approach, Loss Functions for Ranking

2

Following closely Chapter 9, of K. Murphy, Machine Learning: A probabilistic Perspective

Contents

http://www.cs.ubc.ca/~murphyk/MLbook/


Introduction to GLMs In GLMs the output density is in the exponential family, and in which

the mean parameters are a linear combination of the inputs, passed through a possibly nonlinear function, such as the logistic function.

Linear and logistic regression are examples of GLMs.

Let us first consider an unconditional distribution for a scalar response variable:

Here σ2 is the dispersion parameter (often set to 1), θ is the natural parameter, A is the partition function, and c is a normalization constant.

In logistic regression, θ is the log-odds ratio, θ = log(μ/(1−μ)), where μ = E [y] = p(y = 1) is the mean parameter.

3

2 22

( )| , exp ,ii i

y Ap y c y

McCullagh, P. and J. Nelder (1989). Generalized linear models. Chapman and Hall. 2nd edition.

http://mpdc.mae.cornell.edu/Courses/MAE714/TheExponentialFamily.pdf

http://www.amazon.com/Generalized-Edition-Monographs-Statistics-Probability/dp/0412317605


mean function

link function

Introduction to GLMs To convert from the mean parameter to the natural parameter we use a function ψ, so θ = Ψ(μ). ψ is uniquely determined by the form of the exponential distribution.

This is an invertible mapping, so we have μ = Ψ−1(θ)

We know that the mean is given by the derivative of the partition function, so

we have μ = Ψ−1(θ) = A’(θ).

We define a linear function of the inputs/covariates: ηi = wT xi

We make the mean of the distribution to be some invertible monotonic function of ηi . This function, known as the mean function, is denoted by g−1.

The inverse of the mean function, namely g(), is called the link function.

In logistic regression, we set μi = g−1(ηi) = sigm(ηi).

4

1 1 Ti i ig g w x


mean function

link function

Canonical Link Function One particularly simple form of the link function is to use g = ψ; this is called the canonical link function. In this case, θi = ηi = wT xi, so the model becomes

For the Bernoulli/ binomial distribution, the canonical link is the logit function, g(μ) = log(η/(1−η)), whose inverse is the logistic function, μ = sigm(η). We have shown that the mean and variance of the response variable are as

follows:

5

2 22

( )| , , exp ,T T

i i ii i i

y Ap y c y

w x w xx w

2 2 2 2| , , ' , var | , , ''i i i i i i i iy A y A x w x w

http://mpdc.mae.cornell.edu/Courses/MAE714/TheExponentialFamily.pdf


Example: Linear Regression For linear regression, we can write the following:

where In this case,

and thus:

6

, Ti i i iy w x

2

22 2 2

2 2 2

( ) 12log | , , , log 22

iT T i i

i i i ii i i

yy A yp y c y

w x w xx w

2

( )2

A

2[ ] var[ ]i i iy and y


Example: Binomial Regression For binomial regression with , we can write the following:

where:

In this case,

thus:

7

2( ), log , 11

T Tii i i i

i

sigm

w x w x

22

( )log | , , log log 1 log1

T Tii i i i

i i i i i iii

Ny Ap y c y y Ny

w x w xx w

0,1,2,...,i iy N

( ) log 1iA N e

[ ] var[ ] 1i i i i i i iy N and y N


Example: Poisson Regression For Poisson regression with , we can write the following:

where: Poisson regression is used often in biostatistics where yi may represent the

number of diseases of a given person or in a given city, etc.

8

2exp( ), log , ( ) , 1,[ ] var[ ]

T Ti i i i i

i i i

A ey y

w x w x

22

( )log | , , log log !T T

i i ii i i i i i i

y Ap y c y y y

w x w xx w

0,1,2,...iy

Kuan, P., G. Pan, J. A. Thomson, R. Stewart, and S. Keles (2009). A hierarchical semi-Markov model for detecting enrichment with application to ChIP-Seq experiments. Technical report, U. Wisconsin.

http://www.stat.wisc.edu/~keles/Papers/TechReport_HSMMchipseq.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/TechReport_HSMMchipseq.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/TechReport_HSMMchipseq.pdf


MLE Estimation The log-likelihood has the form:

The gradient is easily computed using the chain rule:

If we use a canonial link:

Thus the log likelihood takes a familiar form that can be used directly in stochastic optimization:

9

21

1log | , ( )N

i i i i ii

l p y A

Dw w

'( )i i i i i i i i ii i ij i i ij

j i i i j i i i i

d d d d d d d d dy A x y xdw d d d dw d d d d

mean function

link function i i

i i ij i i ij

d dy x or ydw d

x

w

21

1log |N

i i ii errors

l p y

Dw w w x


Computing the Hessian For using optimization methods that require the Hessian, we can see that:

Thus the Hessian for canonical link is:

H can be used within an IRLS algorithm where the least squares update is: For non-canonical links, H has another term. However, the expected H

remains is given as above. Using the expected Hessian (Fisher information matrix) instead of the actual H is known as the Fisher scoring method.

Finally note that it is easy to modify the above procedure to perform MAP estimation with a Gaussian prior.

10

mean function

link function

12 2

1 1

1 1 ,...,N

T Ti Ni i

i i N

d dd, diagd d d

H x x X SX S

2

2 2

1 1i i i i ii i ij ij ik ij

j k j i k i

d d d d dy x x x xdw dw dw d dw d

1

1

1 1,

T Tt t t t

t t t t t t t twhere g

w X S X X S z

z S y Xw


Bayesian Inference for GLMs Bayesian inference for GLMs is usually conducted using MCMC (see

forthcoming lecture). Possible methods include

o Metropolis Hastings with an IRLS-based proposal (Gamerman 1997),

o Gibbs sampling using adaptive rejection sampling (ARS) for each full-conditional (Dellaportas and Smith 1993)

o Dey et al. provides a review of methodologies.

It is also possible to use the Gaussian (Laplace) approximation or variational inference.

11

Gamerman, D. (1997). Efficient sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing 7, 57–68.

Dellaportas, P. and A. F. M. Smith (1993). Bayesian Inference for Generalized Linear and Proportional Hazards Models via Gibbs Sampling. J. of the Royal Statistical Society. Series C (Applied Statistics) 42(3), 443–459.

Dey, D., S. Ghosh, and B. Mallick (Eds.) (2000). Generalized Linear Models: A Bayesian Perspective. Chapman & Hall/CRC Biostatistics Series.

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/Gamerman1997.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/2986324.pdf

http://www.jstor.org/stable/2986324


http://books.google.com/books?id=Y5AARf7oTNkC&pg=PA3&lpg=PA3&dq=Generalized+Linear+Models:+A+Bayesian+Perspective&source=bl&ots=8Z1e9Lnr9y&sig=cSIP43D1o89JkZwT7c-M9rIbdw4&hl=en&sa=X&ei=9vXOUozqGsi3sASVwoCQDA&ved=0CDsQ6AEwAQ#v=onepage&q=Generalized%20Linear%20Models%3A%20A%20Bayesian%20Perspective&f=false

http://www.amazon.com/Generalized-Linear-Models-Perspective-Biostatistics/dp/0824790340


Probit Regression In binary logistic regression, we use a model of the form p(y = 1|xi,w)

= sigm(wT xi).

We can generalize to p(y = 1|xi,w) = g−1(wT xi), for any function g−1

that maps [−∞,∞] to [0, 1]. Possible mean functions are shown here.

We focus now on the case where g−1(η) = Φ(η), where Φ(η) is the cdf of the standard normal.

This is called probit regression.

The probit function even though similar to the logistic function, it has several advantages over logistic regression.

12

probitPlot from Kevin Murphys’ PMTK

https://code.google.com/p/pmtk3/




Probit Regression We can find the MLE for probit regression using standard gradient

methods. Let Then the gradient of the log-likelihod for a specific case is given by

Here φ is the standard normal pdf, and Φ is its cdf.

Similarly, the Hessian for a single case is given by

We can also compute the MAP estimate. If we use the prior p(w) = N(0,V0), the gradient and Hessian of the penalized log likelihood have the form

These expressions can be used with any gradient-based optimizer.

13

, 1,1Ti i iy w x

iiT Tii i i ii i

i ii

ydd dlogp y logp yd d d y

g | w x | w x xw w

2

22i i iiT T

i i i iiiiii

yd logp yd yy

H | w x x xw

1 10 02 2i i

i i

g V w, H VprobitRegDemo from Kevin Murphys’ PMTK





Interpretation with Latent Variables We interpret the probit model as follows: We associate each xi with

two latent utilities, u0i and u1i, corresponding to the possible choices yi = 0 & yi = 1. The observed choice is whichever action has larger utility.

The model is as follows:

δ’s are errors, representing all other factors thatwe have chosen not to (unable to) model.

This is called a random utility model or RUM.

Since it is only the difference in utilities that matters, let us define zi = u1i −u0i , where εi = δ1i − δ0i. If the δ’s have a Gaussian distribution, then so does εi.

14

0 1 1 1 1, ,T Toi i oi i i i i i oiu u y u u w x w x

McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In P. Zarembka (Ed.), Frontiers in econometrics, pp. 105–142. Academic Press.

Train, K. (2009). Discrete choice methods with simulation. Cambridge University Press. Second edition

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/zarembka.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/train1201.pdf

http://elsa.berkeley.edu/books/train1201.pdf


Interpretation with Latent Variables Thus we can write

This is called the difference RUM or dRUM model.

When we marginalize out zi, we can easily see that we recover the probit model:

15

1 0, , (0,1), 1 0Ti i i i i iz y zw x w w w N

, 0 ( | ,1) 0

1

T T Ti i i i i i i i

T Ti i

p y z z dz p p| x w w x w x w x

w x w x

N

Fruhwirth-Schnatter, S. and R. Fruhwirth (2010). Data Augmentation and MCMC for Binary and Multinomial Logit Models. In T. Kneib and G. Tutz (Eds.), Statistical Modelling and Regression Structures, pp. 111–132. Springer.

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/chp:10.1007/978-3-7908-2413-1_7.pdf



Orbital Probit Regression An advantage of the latent variable interpretation of probit regression is that

it is easy to extend when the response variable is ordinal, i.e. when it can take on C discrete ordered values, such as low, medium and high. This is called ordinal regression.

The basic idea is as follows. We introduce C + 1 thresholds γj and set yi = j if γj−1 < zi ≤ γj , where γ0 ≤ · · · ≤ γC.

For identifiability reasons, we set γ0 = −∞, γ1 = 0 and γC = ∞. For example, if C = 2, this reduces to the standard binary probit model, whereby zi < 0 produces yi = 0 and zi ≥ 0 produces yi = 1.

If C = 3, we partition the real line into 3 intervals: (−∞, 0], (0, γ2], (γ2,∞). We can vary γ2 to ensure the right relative amount of probability mass falls in each interval, so as to match the empirical frequencies of each class label.

Finding the MLEs for this model is a bit trickier than for binary probit regression, since we need to optimize for w and γ, and the latter must obey an ordering constraint. EM and Gibbs sampling have been used.

16

Kawakatsu, H. and A. Largey (2009). EM algorithms for ordered probit models with endogenous regressors. The Econometrics Journal 12(1), 164–186.

Hoff, P. D. (2009, July). A First Course in Bayesian Statistical Methods. Springer.

http://onlinelibrary.wiley.com/doi/10.1111/j.1368-423X.2008.00272.x/pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/j.1368-423X.2008.00272.x.pdf

http://link.springer.com/book/10.1007/978-0-387-92407-6


Multinomial Probit Regression Consider a response variable takes C unordered categorical values, yi ∈ {1, .

. . , C}. The multinomial probit model is defined as follows:

This model is related to multinomial logistic regression. (By defining w = [w1, . . . ,wC], and xic = [0, . . . , 0, xi, 0, . . . , 0], we can recover the more familiar formulation

Since only relative utilities matter, we constrain R to be a correlation matrix. If instead of setting yi = argmaxc zic we use yic = (zic > 0), we get a model known as multivariate probit, which is a way to model C correlated binary outcomes.

17

, ~ ( , ), arg maxTic ic ic i ic

cz y z 0N= w x R

Tic i cz = x w

Fruhwirth-Schnatter, S. and R. Fruhwirth (2010). Data Augmentation and MCMC for Binary and Multinomial Logit Models. In T. Kneib and G. Tutz (Eds.), Statistical Modelling and Regression Structures, pp. 111–132. Springer.

Dow, J. and J. Endersby (2004). Multinomial probit and multinomial logit: a comparison of choice models for voting research. Electoral Studies 23(1), 107–122

Talhouk, A., K. Murphy, and A. Doucet (2011). Efficient Bayesian Inference for Multivariate Probit Models with Sparse Inverse Correlation Matrices. J. Comp. Graph. Statist..



http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/1-s2.0-S0261379403000404-main.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/1-s2.0-S0261379403000404-main.pdf


Multitask Learning Often we want to fit many related classification or regression models. It is

reasonable to assume the input-output mapping is similar across these different models, so we can get better performance by fitting all the parameters at the same time.

This is called multi-task learning and is tackled using hierarchical Bayesian models.

Let yij be the response of the i’th item in group j, for i = 1 : Nj and j = 1 : J. For example, j might index schools, i might index students within a school, and yij might be the test score.

Or j might index people, and i might index purchases, and yij might be the identity of the item that was purchased (this is known as discrete choice modeling).

18

Caruana, R. (1998). A dozen tricks with multitask learning. In G. Orr and K.-R. Mueller (Eds.), Neural Networks: Tricks of the Trade. Springer-Verlag.

Raina, R., A. Ng, and D. Koller (2005). Transfer learning by constructing informative priors. In NIPS. Bakker, B. and T. Heskes (2003). Task Clustering and Gating for Bayesian Multitask Learning. J. of Machine

Learning Research 4, 83–99. Chai, K. M. A. (2010). Multi-task learning with Gaussian processes. Ph.D. thesis, U. Edinburgh.

http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf

http://link.springer.com/chapter/10.1007/3-540-49430-8_9#page-1

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/icml06-transferinformativepriors.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/BakkerH03.pdf

https://www.era.lib.ed.ac.uk/bitstream/1842/3847/1/Chai2010.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Chai2010.pdf


Multitask Learning Let xij be a feature vector associated with yij . The goal is to fit the models

p(yj |xj) for all j. Some groups have lots of data but the majority of groups have little data. We

can use the same model for all groups or fit a separate model for each group, but encourage the model parameters to be similar across groups.

More precisely, suppose where g is a link function. Also let

In this model, groups with small sample size borrow statistical strength from the groups with larger sample size, because the βj ’s are correlated via the latent common parents β∗.

The term controls how much group j depends on the common parents and the term controls the strength of the overall prior.

Suppose, for simplicity, that μ = 0, and that are all known (e.g., they could be set by cross validation).

19

Tij ij ij jy g | x x

2 2* * *~ , , ~ ,j j N N I I

2 2*,j

2*

2j


Multitask Learning The log probability has the form:

We can perform MAP estimation of using gradient methods.

Alternatively, we can perform an iterative optimization scheme alternating between optimizing the and the . Since the lilelihood and the prior are convex this will converge to the global optimum.

Once the model is trained, we can discard and use each model separately.

20

2 2

* *2 2

*

log ( | ) log ( ) log |2 2

jj j

j j

p p p

D D

1: *,J

j*

*


Multitask Learning: Personalized Spam Filtering We want to fit one classifier per user, βj . Since most users do not label their

email as spam or not, it is hard to estimate these models independently. Let βj have a common prior β∗, representing the parameters of a generic user.

In this case, we can emulate the behavior of the above model with a simple trick: we make two copies of each feature xi, one concatenated with the user id, and one not. The effect will be to learn a predictor of the form (u=user ID):

Thus β∗ will be estimated from everyone’s email, whereas wj will just be estimated from the user j’s email.

To see correspondence with the hierarchical Bayesian model, let wj = βj−β∗.

Then the log probability of the original model can be rewritten a

21

Daume, H. (2007b). Frustratingly easy domain adaptation. In Proc. the Assoc. for Comp. Ling. Attenberg, J., K. Weinberger, A. Smola, A. Dasgupta, and M. Zinkevich (2009). Collaborative spam filtering with

the hashing trick. In Virus Bulletin. Weinberger, K., A. Dasgupta, J. Attenberg, J. Langford, and A. Smola (2009). Feature hashing for large scale

multitask learning. In Intl. Conf. on Machine Learning.

[ ] ( ) ( )* 1| , , ,..., , ( 1) ,...Ti i J i i iy u u u Jx w w x x x = = = β

[ ] ( )*| ,T

i i j iy u j= = + x w xβ

( )2 2

** 2 2

*

log |2 2

jj j

j j

p Dσ σ

− −

∑w

+ wβ

β

http://arxiv.org/pdf/0907.1815v1.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/ceas2009-paper-11.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/ceas2009-paper-11.pdf

http://arxiv.org/pdf/0902.2206.pdf

http://arxiv.org/pdf/0902.2206.pdf


Multitask Learning: Personalized Spam Filtering

If we assume , the effect is the same as using the augmented feature trick, with the same regularizer strength for both wj and β∗.

However, one typically gets better performance by not requiring that

22

Finkel, J. and C. Manning (2009). Hierarchical bayesian domain adaptation. In Proc. NAACL, pp. 602–610

( )2 2

** 2 2

*

log |2 2

jj j

j j

p Dσ σ

− −

∑w

+ wβ

β

2 2*jσ σ=

2 2*jσ σ=

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/N09-1068.pdf


Multitask Learning: Domain Adaptation This is the problem of training a set of classifiers on data drawn from

different distributions (e.g. email & news articles text).

This problem is a special case of multi-task learning, where the tasks are the same.

The above hierarchical Bayesian model was used to perform domain adaptation for two Natural Language Processing (NLP) tasks, namely named entity recognition and parsing.

Large improvements over fitting separate models to each dataset, but small improvements when pooling all the data to fit a single model.

23

Finkel, J. and C. Manning (2009). Hierarchical bayesian domain adaptation. In Proc. NAACL, pp. 602–610

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/N09-1068.pdf


Multitask Learning: Other Types of Prior Other than Gaussian priors are often more suitable.

Consider conjoint analysis: figuring out which features of a product customers like best.

This can be modelled with the hierarchical Bayesian setup, but using a sparsity-promoting prior on βj, rather than a Gaussian prior.

This is called multi-task feature selection.

Often we cannot assume that all tasks are all equally similar. If we pool the parameters across tasks that are qualitatively different, the performance will be worse than not using pooling, because the inductive bias of our prior is wrong (negative transfer).

24

Lenk, P., W. S. DeSarbo, P. Green, and M. Young (1996). Hierarchical Bayes Conjoint Analysis: Recovery of Partworth Heterogeneity from Reduced Experimental Designs. Marketing Science 15(2), 173– 191.

Argyriou, A., T. Evgeniou, and M. Pontil (2008). Convex multi-task feature learning. Machine Learning 73(3), 243–272.

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/HBConjointLenkDeSarboGreenYoungMS1996.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/HBConjointLenkDeSarboGreenYoungMS1996.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/mtl_feat.pdf


Multitask Learning: Other Types of Prior One way around this problem is to use as prior a mixture of Gaussians. This

provides robustness against prior mis-specification.

One also can combine Gaussian mixtures with sparsity-promoting priors.

25

Ji, S., D. Dunson, and L. Carin (2009). Multi-task compressive sensing. IEEE Trans. Signal Processing 57 (1) Xu, F. and J. Tenenbaum (2007). Word learning as Bayesian inference. Psychological Review 114(2). Jacob, L., F. Bach, and J.-P. Vert (2008). Clustered Multi-Task Learning: a Convex Formulation. In NIPS.

http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=04627439


http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/XuTenenbaum-PsychRev.pdf

http://arxiv.org/pdf/0809.2085v1.pdf


Generalized Linear Mixed Models We now generalize multi-task learning to allow the response to include

information at the group level, xj , as well as at the item level, xij . Similarly, we can allow the parameters to vary across groups, βj , or to be tied across groups, α. This leads to the following model:

A model with both fixed and random effects is called a mixed model. If p(y|x) is a GLM, the overall model is called a generalized linear mixed effects model or GLMM.

26

( ) ( ) ( ) ( )( )'1 2 3 4| , '

T T T T

ij ij j ij j j j ij jy gx x x x x x φ φ φ φ = + + + β β α α

Here φk are basis functions. Note that the number of βj parameters grows with the number of groups, whereas the size of α is fixed.

The terms βj are called random effects (they vary randomly across groups), and α are called the fixed effect (fixed but unknown constant).

Wand, M. (2009). Semiparametric regression and graphical models. Aust. N. Z. J. Stat. 51(1), 9–41.

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/Wand09.pdf


Generalized Linear Mixed Models Consider the following example. Suppose yij is the amount of spinal bone

mineral density (SBMD) for person j at measurement i. Let xij be the age of person, and let xj be their ethnicity: White, Asian, Black, or Hispanic.

The primary goal is to determine if there are significant differences in the mean SBMD among the four ethnic groups, after accounting for age. There is a nonlinear effect of SBMD vs age, so we use a semi-parametric model which combines linear regression with non-parametric regression. We also see that there is variation across individuals within each group, so

we will use a mixed effects model. Specifically, we will use φ1(xij) = 1 to account for the random effect of each person; φ2(xij) = 0 since no other coefficients are person-specific; φ3(xij) = [bk(xij )], where bk is the k’th spline basis functions, to account for the nonlinear effect of age; and

to account for the effect of the different ethnicities.

27

( ) ( ) ( ) ( ) ( )( )4 , , , ,j j j j jx x w x a x b x h φ = = = = =

Wand, M. (2009). Semiparametric regression and graphical models. Aust. N. Z. J. Stat. 51(1), 9–41. Ruppert, D., M. Wand, and R. Carroll (2003). Semiparametric Regression. Cambridge University Press


http://www.amazon.com/Semiparametric-Regression-Statistical-Probabilistic-Mathematics/dp/0521785162


Generalized Linear Mixed Models The overall model is therefore

α contains the non-parametric part of the model related to age, α’ contains

the parametric part of the model related to ethnicity, and βj is a random offset for person j.

We endow all of these regression coefficients with separate Gaussian priors.

One can perform posterior inference to compute p(α,α’, β,σ2|D)

After fitting, we can compute the prediction for each group.

We can also perform significance testing, by computing p(αg−αw|D) for each ethnic group g relative to some baseline (e.g. white).

28

( )( )

( ) ( ) ( ) ( )2

' ' ' '

~ 0,

|

y

Tij ij j j ij ij w j a j b j h jy x x x x w x a x b x h

N

b

σ

β ε α α α α = + + + = + = + = + = α,

Wand, M. (2009). Semiparametric regression and graphical models. Aust. N. Z. J. Stat. 51(1), 9–41. Ruppert, D., M. Wand, and R. Carroll (2003). Semiparametric Regression. Cambridge University Press


http://www.amazon.com/Semiparametric-Regression-Statistical-Probabilistic-Mathematics/dp/0521785162


Computational Issues GLMMs are difficult to fit. At first, p(yij |θ) may not be conjugate to the prior

p(θ) where θ = (α, β).

Secondly, the regression coefficients θ and the means and variances of the priors η = (μ,σ) are unknown.

One approach is to use fully Bayesian inference methods, such as variational Bayes or MCMC.

An alternative approach is to use empirical Bayes.

In the context of a GLMM, we can use the EM algorithm, where in the E step we compute p(θ|η,D), and in the M step we optimize η.

In general for the E step we need to use approximations (numerical quadrature or Monte Carlo).

29

Hall, P., J. T. Ormerod, and M. P. Wand (2011). Theory of Gaussian Variational Approximation for a Generalised Linear Mixed Model. Statistica Sinica 21, 269–389.

Gelman, A. and J. Hill (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge. Breslow, N. E. and D. G. Clayton (1993). Approximate inference in generalized linear mixed models. J. of the

Am. Stat. Assoc. 88(421), 9–25.

http://matt-wand.utsacademics.info/papers.html

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/Hall11.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/Hall11.pdf

http://www.amazon.com/Analysis-Regression-Multilevel-Hierarchical-Models/dp/052168689X




Computational Issues A faster approach is to use variational EM.

In (frequentist) statistics, there is a popular method for fitting GLMMs called

generalized estimating equations or GEE.

This approach is not recommended as it is not as statistically efficient as likelihood-based methods.

In addition, it can only provide estimates of the population parameters α, but not the random effects βj.

30

Braun, M. and J. McAuliffe (2010). Variational Inference for Large-Scale Models of Discrete Choice. J. of the Am. Stat. Assoc. 105(489), 324–335.

Hardin, J. and J. Hilbe (2003). Generalized Estimating Equations. Chapman and Hall/CRC.

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/jasaE2009Etm08030.pdf

http://www.amazon.com/Generalized-Estimating-Equations-Second-Edition/dp/1439881138


Learning to Rank (LETOR) We want to learn a function that can rank order a set of items (e.g. in

information retrieval).

Specifically, suppose we have a query q and a set of documents d1, . . . , dm that might be relevant to q (e.g., all documents that contain the string q).

We would like to sort these documents in decreasing order of relevance and show the top k to the user. Similar problems arise in collaborative filtering.

A standard way to measure the relevance of a document d to a query q is to use a probabilistic language model based on a bag of words ' model. That is, we define where qi is the i’th word or term, and p(qi|d) is a multinoulli distribution estimated from document d.

In practice, we need to smooth the estimated distribution, e.g. using a Dirichlet prior, representing the overall frequency of each word. This can be estimated from all document in the system:

Here TF(t, d) is the frequency of term t in document d, LEN(d) is the number of words in d, and 0 < λ < 1 is a smoothing parameter.

31 Liu, T.-Y. (2009). Learning to rank for information retrieval. Found. & Trends in Inform. Retrieval 3(3), 225–331.

( , )( | ) (1 ) ( | )( )

TF t dp t d p t backgroundLEN d

λ λ= − +

1

( , ) ( | ) ( | )n

ii

sim q d p q d p q d=

= = ∏



Learning to Rank

Here TF(t, d) is the frequency of term t in document d, LEN(d) is the number of words in d, and 0 < λ < 1 is a smoothing parameter.

There are many other signals that we can use to measure relevance.

For example, the PageRank of a web document is a measure of its authoritativeness, derived from the web’s link structure.

We can also compute how often and where the query occurs in the document.

32

( , )( | ) (1 ) ( | )( )

TF t dp t d p t backgroundLEN d

λ λ= − +

Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3(3), 225–331.

Zhai, C. and J. Lafferty (2004). A study of smoothing methods for language models applied to information retrieval. ACM Trans. on Information Systems 22(2), 179–214.

http://anand.typepad.com/datawocky/2008/05/are-human-experts-less-prone-to-catastrophic-errors-than-machine-learned-models.html


http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/p179-zhai.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/p179-zhai.pdf


Learning to Rank: Pointwise Approach Suppose we collect some training data representing the relevance of a set of

documents for each query.

Specifically, for each query q, suppose that we retrieve m possibly relevant documents dj, for j = 1 : m.

For each query document pair define a feature vector, x(q, d). This contains e.g. the query-document similarity score & page rank score of the document

Suppose we have a set of labels yj representing the degree of relevance of document dj to query q. Such labels might be binary (relevant/irrelevant), or may represent a degree of relevance (very relevant/somewhat relevant/ irrelevant).

Such labels can be obtained from query logs, by thresholding the number of times a document was clicked on for a given query.

If we have binary relevance labels, we can solve the problem using a standard binary classification scheme to estimate, p(y = 1|x(q, d)).

33


Learning to Rank: Pointwise Approach If we have ordered relevancy labels, we can use ordinal regression to predict

the rating, p(y = r|x(q, d)).

In both binary and ordered relevancy label cases, we can then sort the documents by this scoring metric. This is called the pointwise approach to Learning to Rank (LETOR), and is widely used because of its simplicity.

This method does not take into account the location of each document in the list. Thus it penalizes errors at the end of the list just as much as errors at the beginning. In addition, each decision about relevance is made very myopically.

34


Learning to Rank: Pairwise Approach People are better at judging the relative relevance of two items rather than

absolute relevance.

Consequently, the data might tell us that dj is more relevant than dk for a given query, or vice versa.

We can model this kind of data using a binary classifier of the form

Here we set yjk = 1 if rel(dj, q) > rel(dk, q) and yjk = 0 otherwise.

One way to model such a function is as follows:

f(x) is a scoring function, e.g. f(x) = wT x.

35

( ) ( ) ( )( )1, ,jk j k j kp y sigm f f= = −x x x x

( )| ( , ), ( , )jk j kp y q d q dx x

Carterette, B., P. Bennett, D. Chickering, and S. Dumais (2008). Here or There: Preference Judgments for Relevance. In Proc. ECIR

http://research.microsoft.com/en-us/um/people/pauben/papers/HereOrThere-ECIR-2008.pdf



Learning to Rank: Pairwise Approach

This is a special kind of neural network known as RankNet.

We can find the MLE of w by maximizing the log likelihood, or equivalently, by minimizing the cross entropy loss, given by

This can be optimized using gradient descent.

A variant of RankNet is used by Microsoft’s Bing search engine.

36

( ) ( ) ( )( )1, ,jk j k j kp y sigm f f= = −x x x x

( ) ( )( ) ( )

1 1 1

1 log 1| , ,

0 log 0 | , ,

i im mN

ijki j k j

ijk ijk ijk ij ik

ijk ijk ij ik

L L

L y p y

y p y

= = = +

=

− = = =

+ = =

∑∑ ∑

x x w

x x w

Carterette, B., P. Bennett, D. Chickering, and S. Dumais (2008). Here or There: Preference Judgments for Relevance. In Proc. ECIR

Burges, C. J., T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender (2005). Learning to rank using gradient descent. In Intl. Conf. on Machine Learning, pp. 89–96.

http://www.bing.com/blogs/site_blogs/b/search/archive/2009/06/01/user-needs-features-and-the-science-behind-bing.aspx





http://research.microsoft.com/en-us/um/people/cburges/papers/icml_ranking.pdf

http://research.microsoft.com/en-us/um/people/cburges/papers/icml_ranking.pdf


Learning to Rank: The Listwise Approach In pairwise approach decisions about relevance are made just based on a

pair of items (documents), rather than considering the full context.

Consider methods that look at the entire list of items at the same time.

We can define a total order on a list by specifying a permutation of its indices, π. To model our uncertainty about π, we can use the Plackett-Luce distribution. This has the following form:

sj = s(π-1(j)) is the score of the document ranked at the j’th position.

As an example, suppose π = (A,B,C). Then we have that p(π) is the probability of A being ranked first, times the probability of B being ranked second given that A is ranked first, times the probability of C being ranked third given that A and B are ranked first and second. In other words,

37

( )1

|m

jm

ju

u j

sp s

sπ

=

=

= ∏∑

( )| A B B

A B C B C C

s s sp ss s s s s s

π = × ×+ + +

Plackett, R. (1975). The analysis of permutations. Applied Stat. 24, 193–202. Luce, R. (1959). Individual choice behavior: A theoretical analysis. Wiley


http://www.amazon.com/Individual-Choice-Behavior-Theoretical-Mathematics/dp/0486441369


Learning to Rank: The Listwise Approach To incorporate features, we can define s(d) = f(x(q, d)), where we often take

f to be a linear function, f(x) = wT x. This is known as the ListNet model.

To train this model, let yi be the relevance scores of the documents for query i. We then minimize the cross entropy term

This is intractable, since the i’th term needs to sum over mi! permutations. To make it tractable, we consider permutations over the top k positions only:

There are only m!/(m − k)! such permutations. If we set k = 1, we can evaluate each cross entropy term (and its derivative) in O(m) time.

In the special case where only one document from the presented list is deemed relevant, say yi = c, we can use multinomial logistic regression:

This performs as well as ranking methods for collaborative filtering.

38

( | ) log ( | )i ii

p y p p sπ

π−∑∑

( )1: 1:1

|k

jk m m

ju

u j

sp s

sπ

=

=

= ∏∑

Cao, Z., T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li (2007). Learning to rank: From pairwise approach to listwise approach. In Intl. Conf. on Machine Learning, pp. 129a ˘A，S136.

Yang, S., B. Long, A. Smola, H. Zha, and Z. Zheng (2011). Collaborative competitive filtering: learning recommender using context of user choice. In Proc. Annual Intl. ACM SIGIR Conference.

( ) ( )( )'

'

exp|

expc

ic

c

sp y c

s= =

∑x

http://research.microsoft.com/pubs/70428/tr-2007-40.pdf

http://research.microsoft.com/pubs/70428/tr-2007-40.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/SIGIR11CCF.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/SIGIR11CCF.pdf


Loss Functions for Ranking There are various ways to measure the performance of a ranking system.

Mean reciprocal rank (MRR). For a query q, let the rank position of its first

relevant document be denoted by r(q). Then we define the mean reciprocal rank to be 1/r(q). This is a very simple performance measure.

Mean average precision (MAP). In the case of binary relevance labels, we can define the precision at k of some ordering as follows:

We then define the average precision as follows:

Here Ik is 1 iff document k is relevant. For example, if we have the relevancy labels y = (1, 0, 1, 0, 1), then the AP is 1/3 ( 1/1 + 2/3 + 3/5 ) ≈ 0.76.

Finally, we define the mean average precision as the AP averaged over all queries.

39

( )@ num. relevant documents in the top k positions ofp kk

ππ =

( )@ ( )

.

kk

P k IAP

num relevant documents

ππ

⋅=

∑


Loss Functions for Ranking Normalized discounted cumulative gain (NDCG). Suppose the relevance

labels have multiple levels. We can define the discounted cumulative gain of the first k items in an ordering as follows:

where ri is the relevance of item i and the log2 term is used to discount items later in the list.

An alternative definition, that places stronger emphasis on retrieving relevant documents, uses

The trouble with DCG is that it varies in magnitude just because the length of a returned list may vary. It is common to normalize this measure by the ideal DCG, i.e. one obtained by optimal ordering: IDCG@k(r)=argmaxπDCG@k(r)

This can be easily computed by sorting r1:m and then computing DCG@k. Finally, we define the normalized discounted cumulative gain or NDCG as DCG/IDCG. The NDCG can be averaged over queries to give a measure of performance. 40

( ) 12 2

@log

ki

i

rDCG k r ri=

= + ∑

( ) ( )1 2

2 1@log 1

irk

iDCG k r

i=

−=

+∑


Loss Functions for Ranking Rank correlation. We can measure the correlation between the ranked list,

π, and the relevance judegment, π∗, using a variety of methods. One approach, known as the (weighted) Kendall’s τ statistics, is defined in terms of the weighted pairwise inconsistency between the two lists:

These loss functions can be used in different ways. In the Bayesian approach, we first fit the model using posterior inference; this depends on the likelihood and prior, but not the loss. We then choose our actions at test time to minimize the expected future loss.

One way to do this is to sample parameters from the posterior, θs ∼ p(θ|D), and then evaluate, say, the precision@k for different thresholds, averaging over θs

41

( )( ) ( )* *1 sgn sgn

, *2

uv u v u vu v

uvu v

w

w

π π π πτ π π <

<

+ − − =

∑∑

Zhang, X., T. Graepel, and R. Herbrich (2010). Bayesian Online Learning for Multi-label and Multi-variate Performance Measures. In AI/Statistics..

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/zhang10b.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/zhang10b.pdf


Loss Functions for Ranking In the frequentist approach, we try to minimize the empirical loss on the

training set. The problem is that these loss functions are not differentiable functions of the model parameters.

We can either use gradient-free optimization methods, or we can minimize a surrogate loss function instead. Cross entropy loss (i.e., negative log likelihood) is an example of a widely used surrogate loss function.

Another loss, known as weighted approximate-rank pairwise or WARP loss provides a better approximation to the precision@k loss. WARP is defined as follows:

42

( )( ) ( )( )( )( )( ) ( )

'

,: , ,: ,

,: , ( , ') ( , ')y y

WARP f y L rank f y

rank f y f y f y≠

=

= ≥∑

x x

x x x

( ) 1 21

, ... 0k

jj

L k α α α=

= ≥ ≥ ≥∑

Usunier, N., D. Buffoni, and P. Gallinari (2009). Ranking with ordered weighted pairwise classification.

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/icml09-ranking-ordered-weighted-average.pdf


Loss Functions for Ranking Here f (x, :) = [f(x, 1), . . . , f(x, |y|)] is the vector of scores for each possible

output label, or, in IR terms, for each possible document corresponding to input query x.

The expression rank(f (x, :), y) measures the rank of the true label y assigned by this scoring function.

Finally, L transforms the integer rank into a real-valued penalty. Using α1 = 1 and αj>1 = 0 would optimize the proportion of top-ranked correct labels.

Setting α1:k to be non-zero values would optimize the top k in the ranked list, which will induce good performance as measured by MAP or precision@k.

As it stands, WARP loss is still hard to optimize, but it can be further approximated by Monte Carlo sampling, and then optimized by gradient descent.

43

Weston, J., S. Bengio, and N. Usunier (2010). Large Scale Image Annotation: Learning to Rank with Joint Word-Image Embeddings. In Proc. European Conf. on Machine Learning

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/artA10.10072Fs10994-010-5198-3.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/artA10.10072Fs10994-010-5198-3.pdf

Introduction to Generalized Linear Models (GLMs)

Documents

Transcript of Introduction to Generalized Linear Models (GLMs)