BDA3 Chapter 5 Exercise 13
Here’s my solution to exercise 13, chapter 5, of Gelman’s Bayesian Data Analysis (BDA), 3rd edition. There are solutions to some of the exercises on the book’s webpage.
df <- read_csv('data/chapter_03_exercise_08.csv') %>%
filter(type == 'residential' & bike_route) %>%
transmute(
i = 1:n(),
bikes,
other,
total = bikes + other,
rate = bikes / total
)
i | bikes | other | total | rate |
---|---|---|---|---|
1 | 16 | 58 | 74 | 0.2162162 |
2 | 9 | 90 | 99 | 0.0909091 |
3 | 10 | 48 | 58 | 0.1724138 |
4 | 13 | 57 | 70 | 0.1857143 |
5 | 19 | 103 | 122 | 0.1557377 |
6 | 20 | 57 | 77 | 0.2597403 |
7 | 18 | 86 | 104 | 0.1730769 |
8 | 17 | 112 | 129 | 0.1317829 |
9 | 35 | 273 | 308 | 0.1136364 |
10 | 55 | 64 | 119 | 0.4621849 |
We’ll use the prior on \(\alpha\), \(\beta\) given in equation 5.9 and implement it in Stan. Note that stan works on the log scale, so we increment the posterior density (= target
) by \(\log\left( (\alpha + \beta)^{-\frac{5}{2}} \right) = -\frac{5}{2}\log(\alpha + \beta)\). Here is the model code:
model <- rstan::stan_model('src/ex_05_13.stan')
S4 class stanmodel 'ex_05_13' coded as follows:
data {
int<lower = 1> n;
int<lower = 0> total[n];
int<lower = 0> bikes[n];
}
parameters {
real<lower = 0> alpha;
real<lower = 0> beta;
vector<lower = 0, upper = 1>[n] theta;
}
model {
// joint prior on alpha, beta
target += -(5. / 2.) * log(alpha + beta);
// theta prior
theta ~ beta(alpha, beta);
// likelihood
bikes ~ binomial(total, theta);
}
Now we calculate the posterior, using tidybayes to take care of passing the data to stan.
fit <- model %>%
rstan::sampling(data = tidybayes::compose_data(df)) %>%
tidybayes::recover_types(df)
We don’t show any diagnostics here, but there are no divergences, the rhat is very close to 1, the effective sample size is large, and the traces look reasonable.
Now we can draw some samples from the posterior.
draws <- fit %>%
tidybayes::spread_draws(alpha, beta, theta[i])
We can also use our posterior draws to plot the observed (= unpooled) rates against the estimated rates. In this case, there are no large differences, although smaller rates seem to be estimated slightly higher than observed and larger rates lower than observed. All observed rates are within the posterior intervals of the estimates.
To calculate a posterior interval for the average underlying proportion of bike traffic, we sample \(\alpha, \beta\) from the posterior, then draw a new \(\tilde\theta \sim \dbeta(\alpha, \beta)\). It wouldn’t be correct to use the values of \(\theta_j\) from the model parameters, since those are estimates from known streets. If we also want an estimate of the number of bikes observed on a new street (where 100 total vehicles go by), then we draw \(\tilde y \sim \dbinomial(100, \tilde\theta)\) . The posterior intervals are then just the desired quantiles of the drawn parameters.
cis <- draws %>%
filter(i == 1) %>%
mutate(
theta = rbeta(n(), alpha, beta),
y = rbinom(n(), 100, theta)
) %>%
tidybayes::median_qi() %>%
select(matches('y|theta'))
The 95% posterior interval for \(\tilde\theta\) is (3.65%, 49.1%), which includes 10 of the 10 observed rates. The 95% posterior interval of \(\tilde y\) is (3, 49.025), which includes 9 of the 10 observed values. These intervals seem reasonable, although they are fairly wide. It’s difficult to make a statement about how useful these could be in application without a concrete idea of what that application is.