(from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte...

24
Markov-Chain Monte Carlo methods Linear regression Wanna fit a straight line to your data? (from old-fashioned least-squares to Markov Chain Monte-Carlo methods) Javier Gorgas Multidisciplinary seminar. Dpto Física de la Tierra y Astrofísica, 22/11/18 Ordinary Least Squares Problems with error bars MCMC Comparison of methods

Transcript of (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte...

Page 1: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regression

Wanna fit a straight line to your data?

(from old-fashioned least-squares to Markov Chain Monte-Carlo methods)

Javier Gorgas

Multidisciplinary seminar. Dpto Física de la Tierra y Astrofísica, 22/11/18

Ordinary Least Squares Problems with error bars MCMC Comparison of methods

Page 2: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regression“Old fashioned” least squares

Ordinary linear regression using least-squares (OLS). Equivalent to maximum likelihood methods

Do not forget the hypotheses: Independence among the y values Linearity: there is an actual linear relation between x and y The dispersion is only due to errors in the dependent variable y (x values don’t have errors) Gaussianity in the y errors. For a fixed x, errors in y follow a normal distribution: N(0,σ) Homoscedasticity: constant errors (fixed σ). Plus: to make inference about the correlation, the x values must follow a Gaussian distribution

Y|X

X|Y

Remember: if you make a Y|X OLS you are assuming that X is an independent variable (error- free) and Y is a dependent variable. For data with two observed variables, each with its own uncertainties (a symmetric case) Y|X OLS underestimates the slope of the actual relation!

OLS

Page 3: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regressionThe scientific literature is plagued with this problem.

An own example: Sánchez-Blázquez et al. (2006) (presentation of the MILES library)

OLS

Page 4: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regressionHow to solve this?: Easy: symmetrical linear regression

See “Linear Regression in Astronomy” Isobe et al., (1990, ApJ, 364, 104) and Feigelson & Babu (1992, ApJ, 397, 55):

OLS(Y|X) OLS(X|Y)

OR

RMA

Asymmetrical treatment of X and Y: OLS(Y|X) OLS(X|Y)

Symmetrical treatment of X and Y: Bisector OLS: fit that bisects the OLS(Y|X) and OLS(X|Y) Orthogonal regresión(OR): minimizes the perpendicular distances to the line Reduced major axis (RMA): minimizes the sum of distances in X and Y Mean OLS: arithmetic mean of OLS(Y|X) and OLS(X|Y)

2.2 2.3 2.4 2.5 2.6

9.5

10.0

10.5

11.0

11.5

log(σ)(km s)

log(

LL s

ol)

OLS (Y|X)

OLS (X|Y)

OLS bise.

Reg. Ort.

E.M.R.

Media OLS

Different methods provide different solutions. You have to choose the most appropriate. Among the four symmetrical methods the Bisector OLS is usually the best one (the other ones depend on the particular units)

OLS

Page 5: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regressionError bars in the Y values: weighted least squares

M =nX

i=1

d

2i

2i

=nX

i=1

1�

2i

(y⇤i � yi)2 =nX

i=1

wi(a + bxi � yi)2 wi = 1/�2iMinimize:

However: the classic formulae (see ex. Bevington, 2002, Data Reduction and Error Analysis for the Physical Sciences) are only valid when the error bars fully explain the observed dispersion.

�2tip =

P1/�2

iP1/�4

i

⇡ S2r S2

r =n0

n0 � 2

P(yi � y⇤

i )2/�2iP

1/�2i

n0 =�P

1/�2i

�2

P1/�4

i

nº effective points

Note that apart form the measurement errors, given by the individual error bars, the usual situation in the experimental sciences is to have actual, real, dispersion.

Alternative, and only approximate, formulae to derive the errors in the derived coefficients in this case:

�2b = S2

r

P1/�4

iP1/�2

i

1�

X 1�2

i

= S2r

P1/�4

i

��

2a = S

2r

P1/�

4iP

1/�

2i

1�

Xx

2i

2i

Problems with error bars

Page 6: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regression

If you have individual error bars plus actual dispersion you are in trouble:

Example: Velocity dispersions and ages (in Gyrs) for a sample of 53 elliptical galaxies (one of the first evidences of downsizing) Sánchez-Blázquez, et al. 2006, A&A, 457,809 (2006)

weighted

non-weighted

Typical error = 0.037 Residual error = 0.131

A possible approach can be doing simulations: Bootstrap + Monte Carlo.

It helps to provide a first-order of magnitud for the solutions and to understand the distribution of the parameters but does not provide fair estimates of

the parameters and their uncertainties

classic weightedcorrected weighted

slope

Problems with error bars

Page 7: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regressionError bars in the X and Y variables

Homoscedasticity (cte errors) without actual (additional) dispersionSee Feigelson & Babu (1992, ApJ, 397,55) and Fuller (1987, “Measurement Error Models”)

Typical case in astronomy (Ex. relation between two magnitudes, or colors, with the same measurement errors)

Approximate solutions available

b =S

yy

� hSxx

+q

(Syy

� hSxx

)2 + 4hS2xy

2Sxy

h =�2

y

�2x

=�2

�2µ

h = 0 : OLS(X|Y) h = 1 : OR h = ∞ : OLS(Y|X)

Heteroscedasticity without actual (additional) dispersion

Some rough methods (not very good)

York (1966, Can. J. Phys., 44, 1079)

Numerical Recipes (Press et al. 2007, 3rd ed.)

Heteroscedasticity with actual (additional) dispersion

BCES estimator(bivariate correlated errors and intrinsic scatter), Akritas & Bershady (1996, ApJ, 470, 706) ? SIMEX algorithm (simulation-extrapolation), Carroll, 2006, Measurement error in nonlinear models)

Some proposed unsatisfactory solutions This is the most usual situation

Problems with error bars

Page 8: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regression

AIM: To compute, by numerical methods, the posterior probability of a parameter or a model in the context of Bayesian statistics.

P (✓|D) =P (D|✓)P (✓)R

⌦✓0P (D|✓0)P (✓0)d✓0

SOFTWARE

Simultaneous determination of thousands of parameters. Easy inclusion of prior information Easy implementation of many different probability distributions Discrimination among models using information criteria (ex. WAIC) Very powerful hierarchical models Etc, etc, etc, etc, etc, etc,…………

JAGS (Just Another Gibbs Sampler)

Stan (Sampling through adaptative neighborhoods)

Also: emcee

(R + JAGS)

(Python + Stan)Hamiltonian Monte Carlo sampler

The Markov Chain Monte-Carlo approach

posterior

priorlikelihoodThe objetive is to sample from the posterior using different sampling algorithms which navigate through the parameter space (ex. Metropolis algorithm). It delivers chains which reproduce the probability distribution function (pdf) of the parameter. It’s not a minimization procedure.

MCMC

Page 9: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Linear regression

+ model{+ for ( i in 1:Ntotal ) { + zy[i] ~ dnorm(mu[i], 1/zsigma^2)+ mu[i] <- zbeta0 + zbeta1*zx[i] + }+ zbeta0 ~ dnorm(0,1/10^2)+ zbeta1 ~ dnorm(0,1/10^2)+ zsigma ~ dunif(1.0E-3, 1.0E+3)+ }

i

No error bars, data with real dispersion Ex: Faber-Jackson relation Data from Schechter (1980)

3 parameters: slope, intercept and dispersion

MCMC

The results are the full pdf’s of the parameters, not just mean expected values with error bars which assume a particular distribution

Page 10: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Linear regression

+ model{+ for ( i in 1:Ntotal ) {+ zobsy[i] ~ dnorm(zy[i], 1/zerry[i]^2)+ zy[i] ~ dnorm(mu[i], 1/zsigma^2)+ mu[i] <- zbeta0+zbeta1*zx[i]+ }+ }

Error bars in Y plus actual dispersion

Typical case in astrophysics and many sciences. Ej: mass-age relation for E galaxies

8>><

>>:

Sin errores : �1 = 0.38± 0.09Pesando (lm) : �1 = 0.59± 0.08Pesando (lm erry) : �1 = 0.59± 0.15MCMC : �1 = 0.41± 0.10

In this case, the intrinsic actual dispersion (σ = 0.144 ± 0.023) dominates over the typical error (0.037). The classic method of weighted least squares overestimates the weight of the data with the smallest error bars. It allows you to decide if you have a real dispersion

3 parameters: slope, intercept and dispersion

MCMC

No errors Error weighted MCMC

Page 11: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Linear regression

Error bars in X and Y plus actual dispersion Ej: relation between galaxy velocity dispersion

and mass of the central black holes

It is necessary to specify a prior distribution for the data in X. It can be a uniform distribution or, for instance, a Gaussian with hyperpriors, etc.

3 parameters: slope, intercept and dispersion

Data from Tremaine (2002)

⇢�0 = �1.10± 0.80�1 = 4.01± 0.35

� = 0.326± 0.056

+ model{+ for ( i in 1:Ntotal ) {+ zobsx[i] ~ dnorm(zx[i], 1/zerrx[i]^2)+ zobsy[i] ~ dnorm(zy[i], 1/zerry[i]^2)+ zx[i] ~ dunif(-1.E3,1.E3)+ zy[i] ~ dnorm(mu[i], 1/zsigma^2)+ mu[i] <- zbeta0+zbeta1*zx[i] + }+ zbeta0 ~ dnorm(0,1/10^2)+ zbeta1 ~ dnorm(0,1/10^2)+ zsigma ~ dunif(1.0E-3, 1.0E+3)+ }

Clear actual dispersion

Piece of cake with MCMC!

MCMC

This is the most usual situation

Page 12: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regression

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

−5 0 5

−50

5

HdA interpolated

HdA

mea

sure

d

α−0.2 0.0 0.2 0.4 0.6

mean = 0.17

7.3% < 0 < 92.7%

95% HDI−0.0598 0.402

β0.85 0.90 0.95 1.00 1.05 1.10

mean = 0.98

72.1% < 1 < 27.9%

95% HDI0.912 1.05

σ1.0 1.1 1.2 1.3 1.4 1.5 1.6

mean = 1.28

99.6% < 1.51 < 0.4%

95% HDI1.13 1.43

Comparison of measured line-strength indices with those derived interpolating in the previous library for the corresponding stellar parameters

Linear regression with errors in X and Y plus real dispersion. MCMC is the only correct approach → Posterior probability distribution for the intercept, slope and dispersion. 95% Highest Density Intervals (HDI) Program in JAGS

model{for (i in 1:Ntotal) { zobx[i] ~ dnorm(zx[i], 1/zerrx[i]^2) zoby[i] ~ dnorm(zy[i], 1/zerry[i]^2) zx[i] ~ dnorm(zmux, 1/zsigx^2) T(zxmin,zxmax) mu[i] <- zalpha + zbeta*zx[i] zy[i] ~ dnorm(mu[i], 1/zsigma^2)}zalpha ~ dnorm(0, 1/10^2)zbeta ~ dnorm(1, 1/10^2)zsigma ~ dunif(1E-5, 1.E+3)zmux ~ dnorm(0, 1)zsigx ~ dunif(0, 100)}

Another example: Extension of the MILES library

MCMC

Page 13: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regressionMore advantages: hierarchical models

FIGURES 467

Figure 17.6: A model of dependencies for robust hierarchical linear regression. Com-pare with Figure 17.2 on p. 463. Copyright © Kruschke, J. K. (2014). DoingBayesian Data Analysis: A Tutorial with R, JAGS, and Stan. 2nd Edition. AcademicPress / Elsevier.

For a sample with subgroups: to fit a straight line for each subgroup but relate all the intercepts and slopes through a higher order distribution (ex. color-magnitud relations for galaxies in different clusters)

466 FIGURES

40 50 60 70 80 90

50100

150

200

250

All Units

X

Y

50 70 90

50150

250

Unit: 1

X

Y

50 70 90

50150

250

Unit: 2

X

Y

50 70 90

50150

250

Unit: 3

X

Y

50 70 90

50150

250

Unit: 4

X

Y

50 70 90

50150

250

Unit: 5

X

Y

50 70 90

50150

250

Unit: 6

X

Y

50 70 90

50150

250

Unit: 7

X

Y

50 70 90

50150

250

Unit: 8

X

Y

50 70 90

50150

250

Unit: 9

X

Y

50 70 90

50150

250

Unit: 10

X

Y

50 70 90

50150

250

Unit: 11

X

Y

50 70 90

50150

250

Unit: 12

X

Y

50 70 90

50150

250

Unit: 13

X

Y

50 70 90

50150

250

Unit: 14

X

Y

50 70 90

50150

250

Unit: 15

X

Y

50 70 90

50150

250

Unit: 16

X

Y

50 70 90

50150

250

Unit: 17

X

Y

50 70 90

50150

250

Unit: 18

X

Y

50 70 90

50150

250

Unit: 19

X

Y

50 70 90

50150

250

Unit: 20

X

Y

50 70 90

50150

250

Unit: 21

X

Y

50 70 90

50150

250

Unit: 22

X

Y

50 70 90

50150

250

Unit: 23

X

Y

50 70 90

50150

250

Unit: 24

X

Y

50 70 90

50150

250

Unit: 25

X

Y

Figure 17.5: Fictitious data for demonstrating hierarchical linear regression, with pos-terior predicted lines superimposed. Upper panel: All data together, with individualsrepresented by connected segments. Lower panels: Plots of individual data. Noticethat the final two subjects have only single data points, yet the hierarchical model hasfairly tight estimates of the individual slopes and intercepts. Copyright© Kruschke,J. K. (2014). Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. 2ndEdition. Academic Press / Elsevier.

466 FIGURES

40 50 60 70 80 90

50100

150

200

250

All Units

X

Y

50 70 9050

150

250

Unit: 1

X

Y

50 70 90

50150

250

Unit: 2

X

Y

50 70 90

50150

250

Unit: 3

X

Y

50 70 90

50150

250

Unit: 4

X

Y

50 70 90

50150

250

Unit: 5

X

Y

50 70 90

50150

250

Unit: 6

X

Y

50 70 9050

150

250

Unit: 7

X

Y

50 70 90

50150

250

Unit: 8

X

Y

50 70 90

50150

250

Unit: 9

X

Y

50 70 90

50150

250

Unit: 10

X

Y

50 70 90

50150

250

Unit: 11

X

Y

50 70 90

50150

250

Unit: 12

X

Y

50 70 9050

150

250

Unit: 13

X

Y

50 70 90

50150

250

Unit: 14

X

Y

50 70 90

50150

250

Unit: 15

X

Y

50 70 90

50150

250

Unit: 16

X

Y

50 70 90

50150

250

Unit: 17

X

Y

50 70 90

50150

250

Unit: 18

X

Y

50 70 9050

150

250

Unit: 19

X

Y

50 70 90

50150

250

Unit: 20

X

Y

50 70 90

50150

250

Unit: 21

X

Y

50 70 90

50150

250

Unit: 22

X

Y

50 70 90

50150

250

Unit: 23

X

Y

50 70 90

50150

250

Unit: 24

X

Y

50 70 9050

150

250

Unit: 25

X

Y

Figure 17.5: Fictitious data for demonstrating hierarchical linear regression, with pos-terior predicted lines superimposed. Upper panel: All data together, with individualsrepresented by connected segments. Lower panels: Plots of individual data. Noticethat the final two subjects have only single data points, yet the hierarchical model hasfairly tight estimates of the individual slopes and intercepts. Copyright© Kruschke,J. K. (2014). Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. 2ndEdition. Academic Press / Elsevier.

Note that each data point informs not only to the straight line of its particular group, but also to all the others regression lines. There is an important effect of “shrinkage”: The regression lines of each group are much better defined than in individual fits. The net effect is to increase the signal-to-noise. You can even fit a straight line to one single point!

MCMC

Page 14: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regression

Hierarchical fits to derive age and metallicity gradients

Galaxies from CALIFA The intercepts and slopes form a higher order distribution t-distributions instead of normal ones. Programmed in JAGS

Ex. Stellar population gradients in the disks of spiral galaxies

●●

●●

●●

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

9.6

9.7

9.8

9.9

10.0

r (hs)

log

(age

) (LW

)

●●

●●

●●

IC4566 Weighted least-squares MCMC Non hierarchical MCMC Hierarchical fit

β0

−0.16 −0.14 −0.12 −0.10 −0.08 −0.06

mean = −0.112

95% HDI−0.138 −0.0876

β1

−0.045 −0.040 −0.035 −0.030 −0.025 −0.020 −0.015

mean = −0.0303

95% HDI−0.0375 −0.0231

0.00 0.05 0.10 0.15 0.20

0.00

0.05

0.10

0.15

0.20

ebeta1ind

ebeta1adj

errors in the slopes

hier

arch

ical

non hierarchical

MCMC

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

−0.3

−0.2

−0.1

0.0

0.1

0.2

r (hs)

[Z/H

] (LW

)

NGC5378

Page 15: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regression

If you have different kind of objects (ex. galaxy types) you can compare the high order distributions for each kind.

Ex. Stellar population gradients in the disks of spiral galaxies

MCMC

Sb vs Sc−Sdm

∆(β1)−0.10 −0.05 0.00 0.05

mean = −0.0315

97.2% < 0 < 2.8%

95% HDI−0.0645 0.000795

E−S0 S0a−Sab Sb Sbc Sc−Sdm

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

0.1

0.2

LW metallicity gradients

Sb vs Sc−Sdm

∆(β1)−0.15 −0.10 −0.05 0.00

mean = −0.0599

99.7% < 0 < 0.3%

95% HDI−0.101 −0.0161

E−S0 S0a−Sab Sb Sbc Sc−Sdm

−0.3

−0.2

−0.1

0.0

0.1

0.2

LW age gradients

beta0mu[1]−0.10 0.00 0.10

mean = −0.0145

95% HDI−0.0709 0.0417

beta0mu[2]−0.15 −0.05 0.05

mean = −0.0378

95% HDI−0.0799 0.0047

beta0mu[3]−0.15 −0.05

mean = −0.0914

95% HDI−0.135 −0.0484

beta0mu[4]−0.25 −0.15

mean = −0.158

95% HDI−0.198 −0.116

beta0mu[5]−0.5 −0.3

mean = −0.318

95% HDI−0.402 −0.239

beta1mu[1]−0.06 −0.02 0.02

mean = −0.0185

95% HDI−0.0385 0.00152

beta1mu[2]−0.06 −0.03 0.00

mean = −0.0336

95% HDI−0.0473 −0.0196

beta1mu[3]−0.06 −0.04 −0.02

mean = −0.032

95% HDI−0.0441 −0.0202

beta1mu[4]−0.05 −0.03 −0.01

mean = −0.0242

95% HDI−0.0355 −0.0131

beta1mu[5]−0.08 −0.02 0.04

mean = −0.00355

95% HDI−0.0319 0.0249

Note that you don’t make the comparison of the final mean

values but using all the steps of the MCMC chains → probability

distribution of the differences among types

Page 16: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Markov-Chain Monte Carlo methods Linear regressionAnd more advantages: different error models

You don’t need to assume Gaussian errors (in X or Y). You can work with different distributions

A t-distribution for the errors allows you to carry out a robust regression, in which outliers will have a much lower effect in the derived parameters. The degrees of freedom of the t-distribution is then an additional parameter to derive.

Towards generalized linear models:Log-normal models: y > 0 ; Var(y) ∝ E(y)2 Beta models: 0 < y < 1 Gamma models Inverse Gaussian models

Log-normal models Beta models

+ logistic regression, binomial regression, beta-binomial regression, Poissonian

models, negative binomial models, etc.

MCMC

Page 17: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Linear regressionSimple linear regression fits: comparison of methods

See Andreon & Hurn (2013, Statistical Analysis and Data Mining, 6, 15) for a description of different methods (ordinary least squares, weighted least squares, symmetrical fits (bisector, orthogonal regression, EMR, etc), maximum likelihood estimations (MLE), robust regression, BCES (Bivariate Correlated Errors and intrinsic Scatter), survival analysis, bayesian methods, etc.

There are many potential complications when performing simple regression fits: heteroscedasticy in the errors, intrinsic actual dispersion, selection effects, complex data structures, non-uniform populations (Malmquist bias), non-Gaussianity, outliers, etc. First: Why are you making the fit?: parameter estimation, prediction, or model selection? The procedures and difficulties are different in each case.

Comparative analysis of some methods: Ordinary least-squares (OLS Y|X) OLS bisector Bayesian procedure with MCMC

in their usefulness to: 1) estimate the straight line parameters , and 2) make predictions (for y, having observed x). (see Andreon & Weaver for results about BCES y MLE)

Simulated data: values of x and y following the relation:

y = x

with x following an uniform distribution in the (−3, 3) interval and with known measurement errors (σx = σy = 1) in x and in y. Without actual dispersión.

Comparison of methods

Page 18: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Linear regression+ model{+ x ~ dunif(-3,3)+ y <- x + 0+ obsx ~ dnorm(x,1)+ obsy ~ dnorm(y,1)+ } "

We generate 10000 data pairs, from which we are going to use 100 random pairs to make the fits and the rest to check the predictions (data are created using JAGS)

in red: actual values in black: observed values

y vs obsxobsy vs x

Comparison of methods

Page 19: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Linear regressionNote that for a small range of values in x (around obsx), the observed values of y (obsy) are not symmetrical distributed around obsy = obsx, There is an important bias toward lower obsy values.

Ejemplo:

< obsy >= 2.2

< obsx >= 3.0

For a high value of observed x it is more likely to obtain an observed y value below the actual one (and the other way around) (kind of Malmquist bias)

This bias occurs for any value of obsx, being more important in the extremes of the distribution.

Repeating this for all values of obsx:

<obsy> for each obsx.

Comparison of methods

Page 20: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Linear regressionLinear regression:

+ model{+ for (i in 1:length(obx)) {+ x[i] ~ dunif(-3,3)+ y[i] <- alpha + beta*x[i]+ obx[i] ~ dnorm(x[i],1)+ oby[i] ~ dnorm(y[i],1)+ }+ alpha ~ dnorm(0.,1.E-4)+ beta ~ dt(0,1,1)+ } "> jags.params <- c("alpha","beta","oby")> obx <- obsx> oby <- rep(NA,length(obsy))> for (i in 1:100){+ oby[i*100] <- obsy[i*100]+ }> dataList <- list(obx = obx, oby=oby)

Bayesian fit using MCMC

Results are compatible with the theoretical relation ( )y = 0 + 1x

We fit a straight line to a subsample of 100 random points. For the rest we set the observed y values obsy to NA (Not Available) and we make predictions from their observed obsx values. A fit with ~9902 parameters!

Ex: predicted distribution for

the y value corresponding to obsx(5347)

y = (0.20± 0.13) + (1.09± 0.07)x

Comparison of methods

Page 21: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Linear regression

Ordinary least-squares fit (OLS Y|X)

y = (0.24± 0.12) + (0.84± 0.06)x

Bisector OLS fit

y = (0.26± 0.42) + (1.04± 0.06)x

Bayesian fit (MCMC)y = (0.20± 0.13) + (1.09± 0.07)x

OLS Y|X underestimates the slope of the theoretical straight line (y = 0 + 1x) whilst Bisector OLS and the Bayesian fits recover the input value.

Linear regression:

Comparison of methods

Page 22: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Linear regressionPrediction

OLS Y|X OLS bisector MCMC

We check the predictive power of the three fits using the 9900 (10000 − 100) points not used to derive the regression parameters. For OLS Y|X and Bisector OLS we used directly the fitted straight lines. MCMC computes the predicted values as additional parameters of the model, not using the fitted straight line. We plot the residuals: observed − predicted versus the observed x values

The Bisector OLS bisector method provides quite biased results (same for all symmetric methods), the classic OLS Y|X works better but is also biased. The Bayesian MCMC method does not introduce any significant bias. Non-Bayesian fitting methods are not valid to make predictions. The problem is that the relation between y and obsx is not linear: The application of a linear regression to make predictions can never work. This is avoided in MCMC, which derives the linear regression between the real values and does not use this derived linear regression to make predictions.

Comparison of methods

Page 23: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Linear regression

Conclusions:

old-fashioned ordinary least-squares stink

Use MCMC!

Page 24: (from old-fashioned least-squares to Markov Chain …...(R + JAGS) (Python + Stan) Hamiltonian Monte Carlo sampler The Markov Chain Monte-Carlo approach posterior likelihood prior

Linear regression References

Doing Bayesian Data Analysis, A Tutorial with R, JAGS and Stan, 2nd edition, John K. Kruschke, 2015, Elsevier

Bayesian Methods for the Physical Sciences, Learning from Examples in Astronomy and Physics, Stefano Andreon & Brian Weaver, 2015, Springer Series in Astrostatistics

The BUGS Book, A Practical Introduction to Bayesian Analysis, David Lunn et al., 2013, Texts in Statistical Science, CRC Press

C8490

Bayesian statistical methods have become widely used for data analysis and modelling in recent years, and the BUGS software has become the most popular software for Bayesian analysis worldwide. Authored by the team that originally developed this software, The BUGS Book provides a practical introduction to this program and its use. The text presents complete coverage of all the functionalities of BUGS, including prediction, missing data, model criticism, and prior sensitivity. It also features a large number of worked examples and a wide range of applications from various disciplines.

The book introduces regression models, techniques for criticism and comparison, and a wide range of modelling issues before going into the vital area of hierarchical models, one of the most common applications of Bayesian methods. It deals with essentials of modelling without getting bogged down in complexity. The book emphasises model criticism, model comparison, sensitivity analysis to alternative priors, and thoughtful choice of prior distributions—all those aspects of the “art” of modelling that are easily overlooked in more theoretical expositions.

More pragmatic than ideological, the authors systematically work through the large range of “tricks” that reveal the real power of the BUGS software, for example, dealing with missing data, censoring, grouped data, prediction, ranking, parameter constraints, and so on. Many of the examples are biostatistical, but they do not require domain knowledge and are generalisable to a wide range of other application areas.

Full code and data for examples, exercises, and some solutions can be found on the book’s website.

Lunn, Jackson, Best, Thom

as, and SpiegelhalterThe BUGS Book

Statistics

David LunnChristopher Jackson

Nicky BestAndrew Thomas

David Spiegelhalter

The BUGS BookA Practical Introduction to

Bayesian Analysis

Texts in Statistical Science

C8490_Cover.indd 1 8/22/12 3:38 PM

Bayesian Models for Astrophysical Data using R, JAGS, Python and Stan, Joseph M. Hilbe, Rafael S. de Souza and Emille E.O. Ishida, 2017, Cambridge University Press

Statistical Rethinking, Richard McElreath, 2016, Texts in Statistical Science, CRC Press

Bayesian Modeling Using Win BUGS, Ioannis Ntzoufras, 2009, Wiley Series in Computational Statistics, Wiley

Bayesian Data Analysis, 3rd edition, Andrew Gelman et al., 2014, Texts in Statistical Science, CRC Press

Bayesian Computation with R, 2nd edition, Jim Albert, 2009, Use R!, Springer

See many examples in the OpenBUGS User Manual (http://www.openbugs.net/Manuals/Manual.html)

References