Lecture 5: Missing values and Spatial Probit models · Lecture 5: Missing values and Spatial Probit...

Lecture 5: Missing values and SpatialProbit models

James P. LeSage

University of Toledo

Department of Economics

Toledo, OH 43606

[email protected]

October 24, 2005

Motivation

• County auditors need to produce estimates of market value for

unsold homes for real estate tax assessment.

• Using a sample ys of sales prices and Xs of home

characteristics, a hedonic price regression is typically used

to produce β based on the model: ys = Xsβ + us, e.g.,

�ys

yu

�=

�Xs

Xu

�β +

�εs

εu

�(1)

• Assessments are made using: yu = Xuβ.

• β = (X ′sXs)

−1X ′sys

• In this “missing completely at random”, MCAR setting

information on Xu cannot be used to improve on the estimate

β, Little (1992), Little and Rubin (1987) and Rao and

Toutenburg (1995).

• Intuitively, under independence any subset of the data should

work equally well to estimate the parameters of the model.

Knowledge of Xu becomes redundant information.

1

Spatial regression models

• We note that in this setting more appropriate models are:

SAR: y = ρWy + Xβ + ε (2)

SDM: y = ρWy + Xβ + WXoγ + ε (3)

SEM: y = Xβ + ν (4)

SEM: ν = ρWν + ε

ε ∼ N(0, σ2)

• Where:W is an n by n non-negative spatial weight matrix with

zeros on the diagonal Wij > 0 for observations j = 1, . . . , n

sufficiently close to observation i, andPn

j=1 Wij = 1.

• The scalar parameter ρ reflects the magnitude of spatial

dependence.

• The parameters to be estimated in the SAR and SEM models

are: β, σ2ε , and ρ, with the additional parameter vector γ

added to these for the SDM model.

2

• We can partition the SEM model into sold and unsold sub-

samples,

�ys

yu

�=

�Xs

Xu

�β +

�νs

νu

�(5)

�νs

νu

�= ρ

�Wss Wsu

Wus Wss

��νs

νu

�+

�εs

εu

�

• The log-likelihood for this model conditions the observed

sample ys on the missing sample observations yu, as shown

in (6).

L(β, σ2, ρ|ys) =

Zf(y|β, σ

2, ρ)dyu (6)

f(y|β, σ2, ρ) = (2πσ

2)−n/2|Γ|−1/2

exp[−(1/2σ2)e

′Γ−1

e]

Γ = σ−2

var(ν)

• Where: e = y − Xβ which can be partitioned as: es =

ys − Xsβ and eu = yu − Xuβ.

• application of the partitioning scheme to the term: e′Γ−1e

from this likelihood results in the following expression:

3

e′Γ−1

e = e′sΓ

sses + euΓ

uses + e

′sΓ

sueu + e

′uΓ

uueu (7)

• Where Γss, Γus, Γsu and Γuu represent partitioned matrices

from Γ−1.

• The expressions eu = yu − Xuβ in (7) should make it clear

that information regarding the characteristics of unsold homes

or missing values represented by Xu enters into the likelihood.

• In addition, information regarding the spatial covariance or

connectivity between sold and unsold homes captured by

Γus, Γsu contributes to the likelihood.

• Of course, estimates based on maximizing the likelihood in

(6) are not possible because sample observations on yu are

not available. However, we can replace these missing values

with their expected values conditional on the observed sample

information in the likelihood, as motivated by the integral.

Assuming normally distributed disturbances ε in this model,

the expectation of yu|ys is:

E(yu|ys) = Xuβ + ΓusΓss

(ys − Xsβ) (8)

• Since this conditional expectation consists of parameters and

observed sample information, we can replace the missing values

yu in the likelihood function. This would allow maximizing

the likelihood to produce estimates for all parameters in the

model.

4

An important point to note:

• The “full” data likelihood is:

f(ys, yu|Xs, Xu, β, σ2, ρ) = f(yu|ys, Xu, β, σ

2, ρ)

· f(ys|Xs, Xu, β, σ2, ρ)

• Where the second term on the right is the “incomplete” or

marginal data likelihood and it is available explicitly. We can

integrate out yu, and just maximize f(ys|Xs, Xu, β, σ2, ρ)

over the parameters and this will give the incomplete

data MLE’s. If prediction of yu is of interest,

E(yu|ys, Xs, Xu, β, σ2, ρ) is a parametric function of the

model parameters. Plugging in MLE’s yields the MLE of this

predictive mean.

• Improved spatial prediction has nothing to do with imputing

unobserved responses yu.

• It does however have to do with utilizing Xu in addition to

Xs.

5

The SAR model:

• The same approach can be taken for the SAR model

partitioned (with the SDM model a trivial extension).

• A difference is that the expression for E(yu|ys) in this model

takes the form shown in (9) based on standard multivariate

normal distribution theory (Poirier, 1995, page 122).

E(yu|ys) = µu + ΓusΓss

(ys − µs) (9)

µu = B21

Xuβ + B22

Xsβ

µs = B11

Xsβ + B12

Xuβ

B =

�Is − ρWss −ρWsu

−ρWus Iu − ρWuu

�(10)

• In (9), Bij denotes the i, j partition of the inverse of the

matrix B, which is a function of the parameter ρ alone. As in

the case of the SEM model, we can maximize the log-likelihood

after replacing the missing values yu with their expected values

conditional on the observed sample information, ys. From a

computational standpoint this requires that we calculate the

matrix inverse, B−1, which is an n by n matrix.

6

Implementation details

• We first consider the SEM model, where things are

computationally simpler than for the SAR model. We require

an efficient way to compute:

E(yu|ys) = Xuβ + ΓusΓss

(ys − Xsβ) (11)

• Straightforward implementation of (11) involves the inverse

matrix σ2[(In − ρW )(In − ρW ′)]−1.

• The spatial weight matrix W is sparse, as is (In − ρW ).

Unfortunately, matrix inversion results in “fill-in” of the sparse

matrix product (In − ρW )′(In − ρW ), adding to memory

storage requirements.

• As an example, the number of non-zero elements in [(In −ρW )′(In−ρW )] for a sample of 4,048 housing observations

from Lucas County Ohio was 30,748, whereas the number

in [(In − ρW )(In − ρW ′)]−1 was 1,639,838, requiring an

increase in memory of over 50 times.

7

• Gelman, Carlin, Stern and Rubin (1995, p. 479)

provide the key to efficient computation. Let Γ−1 =

(1/σ2)(I − ρW )′(I − ρW ), and form the matrix C =

(I − [diag(Γ−1)]−1Γ−1, such that the distribution of the ith

element of y, conditional on all other elements j 6= i takes

the form:

(yi|yj, j 6= i) ∼ N [µi +Xj 6=i

cij(yj −µj), (1/Γii)] (12)

• Adopting this in a way that is ideally suited to our application

based on the partitioning of observations into observed and

missing values results in:

C = In − [diag(Γ−1

)]−1

Γ−1

(13)

E(yu|ys) = µu + Cus(ys − µs)

var(yu|ys) = ι � diag(Γ−1

)uu

• Where ι denotes a vector of ones, and ι � diag(Γ−1)uu

indicates dividing ι elementwise by the vector diag(Γ−1)uu,

reflecting the main diagonal elements of the uu partition from

the inverse of Γ. The expression Cus refers to the partitioned

element of C.

8

• The key point to note here is that the expression Γ−1 =

(1/σ2ε)(In − ρW )′(In − ρW ) that appears in (13), allows

us to avoid computing the inverse.

• In addition, two of the matrices are diagonal matrices,

[diag(Γ−1)]−1, and ι � diag(Γ−1)uu, resulting in a

tremendous savings in memory, as well as speedy computation

of the matrix C.

• A related point is that computing diag(Γ−1) involves simply

computing the sum of the squared column elements of the

matrix W , multiplying these by ρ2, and adding a vector of

ones. e.g.,

diag[(In − ρW )′(In − ρW )] = diag(In + ρ

2W

′W − ρW

′ − ρW )

= diag(In + ρ2W

′W )

(This occurs because the spatial weight matrix has zeros on

the main diagonal.)

9

Alternative approaches to estimation

• Approach #1: Maximization of the log-likelihood for this

problem involves a constrained optimization problem with

parameters β, σ and ρ constrained to an interval 1/λmin <

ρ < 1/λmax, where λmin, λmax denote the minimum and

maximum eigenvalues of the spatial weight matrix W .

• Approach #2: An alternative is to construct a “repaired”

data vector y′ = [ys E(yu|ys)]′ and use existing efficient

algorithms for maximum likelihood estimation of the SAR,

SEM, SDM models in an iterative scheme, similar to an E-M

algorithm.

• Approach #3: Another alternative is to treat E(yu|ys) as

latent variables that become part of the estimation problem in

an MCMC scheme.

10

Advantages of Approach #3

1. Bayesian prior information can be introduced for the

parameters β, σ, ρ based on perhaps the observed sample

ys, Xs. This could reduce the dispersion of estimates and

prediction outcomes.

2. One need not assume a multivariate normal distribution for y,

a multivariate form of the t−distribution could be used. As

in the case of the multivariate normal distribution, conditional

distributions from the multivariate t take a multivariate t

form. Assuming y ∼ tn(µ, Σ, δ) will lead to a conditional

mean: E(yu|ys) = [µu +ΣusΣ−1ss (ys−µs)], (see Theorem

3.4.10 in Poirer (1995)).

3. Posterior measures of dispersion for the estimates and

predictions are easily obtainable from the MCMC draws.

This is in contrast to maximum likelihood estimation where

numerical hessians are required to compute measures of

dispersion.

11

Specifics of approach #2

• We have already established that it is easy to construct

E(yu|ys), which can be used to produce a repaired vector

y. This in conjunction with the log-likelihood function

concentrated with respect to β, σ:

lnL(ρ) = F + ln|In − ρW | − (n/2)ln(e′e)

e = eo − ρed

eo = y − Xβo

ed = Wy − Xβd

βo = (X′X)

−1X

′y

βd = (X′X)

−1X

′Wy (14)

Where F represents a constant not involving the parameters.

Pace and Barry (1997) demonstrate that direct sparse matrix

Cholesky algorithms can be used to compute the log-

determinant over a grid of values for the parameter ρ restricted

to the interval (−1, 1), and Barry and Pace (1999) provide

an approximation approach for doing this.

• Given this grid, a vector evaluation of the SAR log-likelihood

function over this grid of log-determinant values can be used

to find maximum likelihood estimates.

12

• Specifically,we might use a grid of q values for ρ in the interval

(1/λmin, 1/λmax)

0BB@

LnL(ρ1)LnL(ρ2)

...LnL(ρq)

1CCA ∝

0BB@

Ln|In − ρ1W |Ln|In − ρ2W |

...Ln|In − ρqW |

1CCA−(n/2)

0BB@

Ln(φ(ρ1))Ln(φ(ρ2))

...Ln(φ(ρq))

1CCA

(15)

where φ(ρi) = e′oeo − 2ρie′deo + ρ2

i e′ded.

• Having solved for the optimal ρ, conditional on the repaired

sample involving yu, the parameters β = βo − ρβd, and

σ2 = (eo−ρed)′(eo−ρed)/(n−k), can be used to produce

a new yu and another repaired sample for y. This forms the

basis for another iteration of the (E-M like) algorithm, with

the process continued until convergence.

13

Specifics of approach #3

• A formal statement of the Bayesian SAR model is shown in

(16), where we have added a normal-gamma conjugate prior

for β and σ, and a uniform prior for ρ. The prior distributions

are indicated using π.

y = ρWy + Xβ + ε (16)

ε ∼ N(0, σ2In)

π(β) ∼ N(c, T )

π(1/σ2) ∼ Γ(d, ν)

π(ρ) ∼ U [0, 1]

• To implement this estimation method, we need to determine

the conditional distributions for each parameter in our Bayesian

SAR model as well as yu, which will be treated as a latent

variable and could be viewed as additional parameters to be

estimated.

14

MCMC sampler

• begin with arbitrary β0, σ0, ρ0 and sample sequentially from:

1. p(yu|ys, β0, σ0, ρ0), shown in the paper, using the

computationally efficient approach to calculation described.

This updated value for the parameter vector yu we label

y1u.

2. p(β|σ0, ρ0, y1u), which is a multinormal distribution with

mean and variance defined in the paper. Note that we

rely on the updated value of the parameter vector y1u when

evaluating this conditional density.

3. p(σ|β1, ρ0, y1u), which is chi-squared distributed with n+

2d degrees of freedom as shown in the paper.

4. p(ρ|β1, σ1, y1u), which we sample using numerical

integration and inversion. Note also that it is easy to

implement a normal or some alternative prior distribution

for this parameter.

• We now return to step 1) employing the updated parameter

values in place of the initial values y0u, β0, σ0, ρ0. On each

pass through the sequence we collect the parameter draws

which are used to construct a joint posterior distribution for

the parameters in our model.

15

Monte Carlo experiment

Results somewhat dependent on the spatial configuration or

location of the missing versus non-missing observations.

• One way to compare accuracy is to consider the gap between

OLS prediction accuracy and the best possible prediction

accuracy associated with a benchmark model, SAR-ALL based

on using all of the data.

– OLSe = 0.30

– SAR − ALLe = 0.15

– gap = mean(|(eOLS|)− mean(|(eALL|) = 0.15

– SAR − MISSe = 0.20

– The SAR-MISS model closed 0.1/0.15 = 66 % of the

gap.

• For the case of high spatial connectivity sample the SAR-MISS

model closed 97 percent of the gap.

• For the low spatial connectivity sample the SAR-MISS closed

89.5 percent of the gap.

• Figures showing distribution of estimates for β and ρ.

16

−4 −3 −2 −1 0 1 2 3−4

−3

−2

−1

0

1

2

3

4soldunsold

Figure 1: The location of unconnected sold and unsold properties

17

0 200

0

100

200

nz = 1406

WSS

0 200

0

100

200

nz = 97

WSU

0 100 200

0

100

200

nz = 1406

WUU

Figure 2: Low spatial connectivity between sold and unsold

properties

18

−4 −3 −2 −1 0 1 2 3−4

−3

−2

−1

0

1

2

3

4soldunsold

Figure 3: The location of connected sold and unsold properties

19

0 200

0

100

200

nz = 600

WSS

0 200

0

100

200

nz = 906

WSU

0 100 200

0

100

200

nz = 594

WUU

Figure 4: High spatial connectivity between sold and unsold

properties

20

0.74 0.76 0.78 0.8 0.82 0.84 0.860

5

10

15

20

25

30

35

40

45

ρ values

Dis

trib

utio

n of

est

imat

es

SAR−EMSAR−ALL

Figure 5: Distribution of rho estimates, 1,000 trials

21

0.94 0.96 0.98 1 1.02 1.04 1.060

5

10

15

20

25

30

35

β intercept parameters

Dis

trib

utio

n of

est

imat

es

SAR−EMSAR−ALL

0.94 0.96 0.98 1 1.02 1.04 1.060

5

10

15

20

25

30

35

β slope parameters

Dis

trib

utio

n of

est

imat

es

SAR−EMSAR−ALL

Figure 6: Distribution of intercept and slope estimates, 1,000

trials

22

Similar results for actual real estate data

• Five time periods examined, using 1993 sales to predicted

1994, 1994 to predict 1995, and so on. Using around 4,000-

6,000 sales to predict 4,000-6,000 sales the following year.

• The SAR-MISS predictions closed 61 to 74 percent of the gap

between OLS and SAR-ALL accuracy, averaging around 70

percent.

• Using 5,164 observations (one of every six) to predict the

remaining 25,823 we found: 58.1 percent of the gap between

OLS and SDM accuracy closed by the SDM-MISS model.

• Using 5,164 observations (one of every six) to predict

the remaining 25,823 we found: 68.5 percent of the gap

between OLS and SDM accuracy closed by the Bayesian

heteroscedastic/robust MCMC SDM model, compared to 58.1

percent for the non-robust model.

23

Conclusion

• The issue of using widely available information on the

characteristics and location of unsold properties has not

received much attention in the literature on hedonic price

models.

• These characteristics in combination with knowledge of their

location relative to sold properties provides a large amount

of covariance information that need not be ignored simply

because the dependent variable is missing.

• Spatial estimators employing information on the unsold

properties have direct application in real estate assessment,

and such approaches have potential to materially change price

indices, and to ameliorate sample selectivity biases.

• Future work: sample selectivity bias associated with using only

sold homes.

• Future work: the impact of Bayesian prior information based

on the sold sample.

24

A spatial probit model

• A Bayesian probit model with individual effects that exhibit

spatial dependencies is set forth. Since probit models are often

used to model variation in individual choices, a model that

includes spatial interaction effects due to latent unobservables

associated with varying spatial location of the decision makers

seems useful.

Examples:

– Commuting choices made by individuals living in various

parts of a city.

– Public policy issues related to the Tiebout hypothesis such

as tax competition, welfare competition, benefit spillovers,

environmental/pollution spillovers.

– Voting choices.

25

• Individual choice is often modelled as dependent on:

– observed attributes of the choices

– observed characteristics of individuals.

Latent or unobserved attributes of the choices or characteristics

of individuals are frequently ignored. To the extent that these

latent factors are associated with the location of decision-

makers we can parsimoniously model these using a spatial

autoregression.

• The model proposed here allows for a parameter vector of

spatial interaction effects that takes the form of a spatial

autoregression. This model extends the class of Bayesian

spatial logit/probit models presented in LeSage (2000) and

relies on a hierachical construct that we estimate via Markov

Chain Monte Carlo methods.

26

Utility maximizing choices

For a binary 0, 1 choice, made by individuals k in region i

with alternatives labeled a = 0, 1:

Uik0 = γ′ωik0 + α

′0sik + θi0 + εik0

Uik1 = γ′ωik1 + α

′1sik + θi1 + εik1 (17)

Where:

ω represent observed attributes of the a = 0, 1 alternative

s represent observed attributes of individuals k

θia + εika represent unobserved properties of individuals

k, regions i or alternatives a.

We decompose the unobserved effects on utility into:

– a regional effect θia, assuming homogeneity across

individuals k in region i.

– an individualistic effect εika

The individualistic effects, εika are assumed conditionally

independent given θia, so unobserved dependencies between

individual utilities for alternative a within region i are captured

by θia.

27

Spatial autoregressive unobserved interactioneffects

Following Amemiya (1995, section 9.2) one can use

utility differences between alternatives along with the utility

maximization hypothesis to arrive at a probit regression

relationship.

zik = Uik1 − Uik0

= x′ikβ + θi + εik (18)

Our contribution is to model the the unobserved dependencies

between utility differences of individuals in separate regions

(the regional effects θi : i = 1, . . . , m) as following a

spatial autoregressive structure:

θi = ρ

mXj=1

wijθj + ui, i = 1, . . . , m

u ∼ N(0, σ2Im) (19)

Intuition here is that unobserved utility-difference aspects that

are common to individuals in a given region may be similar to

those for individuals in neighboring or nearby regions.

28

Heteroscedastic individual effects

• εik are treated as exchangeable and modeled as conditionally

iid normal variates with zero means and common variance vi,

given θi.

εi = (εik : k = 1, . . . , ni)′

εi|θi ∼ N(0, viIni)

ε|θ ∼ N(0, V )

V =

0@ v1In1

. . .

vmInm

1A

• The model in vector form:

z = Xβ + ∆θ + ε

∆ =

0@ 11

. . .

1m

1A

εi|θi ∼ N(0, viIni)

V = diag(∆v), v = (vi : i = 1, . . . , m)(20)

29

Albert and Chib (1993) latent treatment of z

Pr(Yik = 1|zik) = δ(zik > 0) (21)

Pr(Yik = 0|zik) = δ(zik ≤ 0)

• Where: δ(A) is an indicator function δ(A) = 1 for all

outcomes in which A occurs and δ(A) = 0 otherwise

• If the outcome value Y = (Yik ∈ 0, 1), then [following

Albert and Chib (1993)] these relations may be combined as

follows:

Pr(Yik = yik|zik) = δ(yik = 1)δ(zik > 0)

+ δ(yik = 0)δ(zik ≤ 0) (22)

• Which produces a conditional posterior for zik that is a

truncated normal distribution, which can be expressed as

follows:

zik|? ∼�

N(x′iβ + θi, vi) left-truncated at 0, if yi = 1

N(x′iβ + θi, vi) right-truncated at 0, if yi = 0(23)

30

Hierachical Bayesian Priors

The following prior distributions are standard [see LeSage,

(1999)]:

β ∼ N(c, T ) (24)

r/vi ∼ IDχ2(r) (25)

1/σ2 ∼ Γ(α, ν) (26)

ρ ∼ U [(λ−1min, λ

−1max)] (27)

These induce the following conditional priors:

π(θ|ρ, σ2) ∼ (σ

2)−m/2|Bρ|exp

�−

1

2σ2θ′B′ρBρθ

�

Bρ = Im − ρW (28)

π(ε|V ) ∼ |V |−1/2exp

�−

1

2ε′V−1

ε

�(29)

π(z|β, θ, V ) ∝ |V |−1/2exp

�−

1

2e′V−1

e

�(30)

e = z − Xβ −∆θ

31

Estimating the model

• Estimation will be achieved via Markov Chain Monte Carlo

methods that sample sequentially from the complete set of

conditional distributions for the parameters. To implement

the MCMC sampling approach we need to derive the complete

conditional distributions for all parameters in the model. Given

these, we proceed to sample sequential draws from these

distributions for the parameter values. Gelfand and Smith

(1990) demonstrate that MCMC sampling from the sequence

of complete conditional distributions for all parameters in the

model produces a set of estimates that converge in probability

to the true (joint) posterior distribution of the parameters.

• The complete conditional distributions for all parameters in

the model are derived in the paper.

• A few comments on innovative aspects:

32

The conditional distribution of θ:

p(θ|β, ρ, σ2, V, z, y) ∼ N(A

−10 b, A

−10 ) (31)

A0 = σ−2

B′ρBρ + ∆

′V−1

∆

b0 = ∆′V−1

(z − Xβ)

• where the mean vector is A−10 b0 and the covariance matrix

is A−10 , which involves the inverse of the mxm matrix A0

which depends on ρ. This implies that this matrix inverse

must be computed on each MCMC draw during the estimation

procedure. Typically a few thousand draws will be needed to

produce a posterior estimate of the parameter distribution

for θ, suggesting that this approach to sampling from the

conditional distribution of θ may be costly in terms of time if

m is large.

• In our illustration we rely on a sample of 3,110 US counties

and the 48 contiguous states, so that m = 48. In this

case, computing the inverse was relatively fast allowing us

to produce 15,000 draws in 285 seconds using a compiled c-

language program on a laptop with a Pentium 1.6M processor.

33

An alternative approach for large problems

• In the Appendix we provide an alternative approach that

involves only univariate normal distributions for each element

θi conditional on all other elements of θ excluding the ith

element, (θ−i). This approach is amenable to computation

for much larger sizes for m.

Specifically:

– The univariate normal density for each θi given θ−i, i =

1, . . . , m takes the form:

θi|(θ−i, β, ρ, σ2, V, z, y) ∼ N

�bi

ai

,1

ai

�(32)

ai =1

σ2+

ρ2

σ2w′.iw.i +

ni

vi

bi = φi +ρ

σ2

Xj 6=i

θj(wji + wij)−ρ2

σ2w′.iW−iθ−i

• This approach allows us to produce estimates for a problem

using nearly 60,000 US census tracts with 3,000 counties as

the regions. Time required to produce 10,000 draws is around

45 minutes using a compiled c-language program on a laptop

with a Pentium 1.6M processor.

34

The conditional distribution of ρ:

p(ρ|?) ∝ |Bρ|exp�−

1

2σ2θ′(Im − ρW )

′(Im − ρW )θ

�

(33)

• where ρ ∈ (λ−1min, λ−1

max). As noted in LeSage (2000) this is

not reducible to a standard distribution, so we might adopt

a M-H step during the MCMC sampling procedures. LeSage

(1999) suggests a normal or t− distribution be used as a

transition kernel in the M-H step.

• Another approach that we are experimenting with for this

model is to rely on univariate numerical integration to obtain

the the conditional posterior density of ρ, and then produce

a draw by inversion. This requires integration on every trip

through the sampler, but is speedy using vectorization.

35

Special cases of the model

• The homoscedastic case, where individual variances are

assumed equal across all regions, so the regional variance

vector, v reduces to a scalar

• The individual spatial-dependency case where individuals

are treated as ‘regions’ denoted by the index i.. In this

case we are essentially setting m = n and ni = 1 for all

i = 1, . . . , m.

• Note that although one could in principle consider

heteroscedastic effects among individuals, the existence of

a single observation per individual renders estimation of such

variances problematic at best.

36

Lecture 5: Missing values and Spatial Probit models · Lecture 5: Missing values and Spatial Probit...

Documents

Transcript of Lecture 5: Missing values and Spatial Probit models · Lecture 5: Missing values and Spatial Probit...