The basic area-level model (Fay and Herriot 1979; Rao and Molina 2015) is given by \[ y_i | \theta_i \stackrel{\mathrm{iid}}{\sim} {\cal N} (\theta_i, \psi_i) \,, \\ \theta_i = \beta' x_i + v_i \,, \] where \(i\) runs from 1 to \(m\), the number of areas, \(\beta\) is a vector of regression coefficients for given covariates \(x_i\), and \(v_i \stackrel{\mathrm{iid}}{\sim} {\cal N} (0, \sigma_v^2)\) are independent random area effects. For each area an observation \(y_i\) is available with given variance \(\psi_i\).
First we generate some data according to this model:
m <- 75L # number of areas
df <- data.frame(
area=1:m, # area indicator
x=runif(m) # covariate
)
v <- rnorm(m, sd=0.5) # true area effects
theta <- 1 + 3*df$x + v # quantity of interest
psi <- runif(m, 0.5, 2) / sample(1:25, m, replace=TRUE) # given variances
df$y <- rnorm(m, theta, sqrt(psi))
A sampler function for a model with a regression component and a random intercept is created by
library(mcmcsae)
model <- y ~ reg(~ 1 + x, name="beta") + gen(factor = ~iid(area), name="v")
sampler <- create_sampler(
model,
family=f_gaussian(var.prior=pr_fixed(1), var.vec = ~ psi),
linpred="fitted", data=df
)
The meaning of the arguments used is as follows:
family
argument allows to set one of a number of
sampling distributions and possibly pass additional family-dependent
arguments. In this case the scalar observation level variance parameter
is set to a fixed value 1, and unequal variances are set to the vector
psi
.linpred="fitted"
indicates that we wish to obtain
samples from the posterior distribution for the vector \(\theta\) of small area means.data
is the data.frame
in which variables
used in the model specification are looked up.An MCMC simulation using this sampler function is then carried out as follows:
A summary of the results is obtained by
## llh_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## llh_ -28 5.86 -4.77 0.119 -38.1 -27.7 -19 2438 1
##
## linpred_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## 1 3.60 0.309 11.66 0.00564 3.08 3.60 4.09 3000 0.999
## 2 1.41 0.170 8.26 0.00318 1.12 1.41 1.69 2866 1.000
## 3 2.96 0.392 7.54 0.00754 2.32 2.95 3.61 2703 1.001
## 4 2.10 0.199 10.54 0.00363 1.78 2.09 2.43 3000 1.000
## 5 1.53 0.313 4.88 0.00585 1.01 1.53 2.05 2859 1.000
## 6 3.03 0.242 12.52 0.00466 2.63 3.03 3.44 2708 1.002
## 7 2.50 0.319 7.83 0.00607 1.98 2.49 3.03 2758 1.001
## 8 3.90 0.327 11.94 0.00624 3.38 3.90 4.44 2739 1.000
## 9 1.60 0.344 4.66 0.00627 1.05 1.59 2.20 3000 1.000
## 10 3.09 0.363 8.52 0.00671 2.51 3.09 3.70 2932 1.000
## ... 65 elements suppressed ...
##
## beta :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## (Intercept) 0.924 0.139 6.65 0.00255 0.696 0.924 1.15 2967 1
## x 3.013 0.257 11.73 0.00469 2.590 3.015 3.43 3000 1
##
## v_sigma :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 0.5 0.0624 8 0.00139 0.404 0.498 0.61 2016 1
##
## v :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## 1 0.00498 0.317 0.0157 0.00578 -0.5273 0.00648 0.517 3000 0.999
## 2 0.47187 0.202 2.3338 0.00381 0.1440 0.46878 0.809 2821 1.001
## 3 0.10435 0.390 0.2675 0.00750 -0.5121 0.08952 0.751 2704 1.000
## 4 0.36444 0.210 1.7341 0.00384 0.0302 0.36103 0.715 3000 1.000
## 5 -0.07297 0.313 -0.2329 0.00582 -0.5925 -0.07292 0.454 2899 1.000
## 6 0.39592 0.251 1.5794 0.00479 -0.0137 0.39159 0.805 2742 1.002
## 7 0.38782 0.319 1.2148 0.00612 -0.1151 0.38418 0.922 2722 1.001
## 8 0.66963 0.331 2.0242 0.00638 0.1335 0.65897 1.223 2691 1.001
## 9 0.36139 0.345 1.0465 0.00631 -0.1918 0.34871 0.955 3000 0.999
## 10 -0.01508 0.361 -0.0418 0.00664 -0.5886 -0.01780 0.579 2947 1.000
## ... 65 elements suppressed ...
In this example we can compare the model parameter estimates to the ‘true’ parameter values that have been used to generate the data. In the next plots we compare the estimated and ‘true’ random effects, as well as the model estimates and ‘true’ estimands. In the latter plot, the original ‘direct’ estimates are added as red triangles.
plot(v, summ$v[, "Mean"], xlab="true v", ylab="posterior mean"); abline(0, 1)
plot(theta, summ$linpred_[, "Mean"], xlab="true theta", ylab="estimated"); abline(0, 1)
points(theta, df$y, col=2, pch=2)
We can compute model selection measures DIC and WAIC by
## DIC p_DIC
## 105.23490 49.31653
## WAIC1 p_WAIC1 WAIC2 p_WAIC2
## 76.74258 20.82089 99.23977 32.06948
Posterior means of residuals can be extracted from the simulation
output using method residuals
. Here is a plot of (posterior
means of) residuals against covariate \(x\):
A linear predictor in a linear model can be expressed as a weighted
sum of the response variable. If we set
compute.weights=TRUE
then such weights are computed for all
linear predictors specified in argument linpred
. In this
case it means that a set of weights is computed for each area.
sampler <- create_sampler(
model,
family=f_gaussian(var.prior=pr_fixed(1), var.vec = ~ psi),
linpred="fitted", data=df, compute.weights=TRUE
)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
Now the weights
method returns a matrix of weights, in
this case a 75 \(\times\) 75 matrix
\(w_{ij}\) holding the weight of direct
estimate \(i\) in linear predictor
\(j\). To verify that the weights
applied to the direct estimates yield the model-based estimates we plot
them against each other. Also shown is a plot of the weight of the
direct estimate for each area in the predictor for that same area,
against the variance of the direct estimate.
plot(summ$linpred_[, "Mean"], crossprod(weights(sim), df$y),
xlab="estimate", ylab="weighted average")
abline(0, 1)
plot(psi, diag(weights(sim)), ylab="weight")