CIS Primer Question 3.2.1
Here are my solutions to question 3.2.1 of Causal Inference in Statistics: a Primer (CISP).
Here are the parameters we’ll use. Note that they are taken from the Simpson’s revesal example of question 1.5.2.
r <- 0.28 # fraction with syndrome
q0 <- 0.07 # P(X = 1 | Z = 0)
q1 <- 0.85 # P(X = 1 | Z = 1)
p00 <- 0.84 # P(Y = 1 | X = 0, Z = 0)
p10 <- 0.88 # P(Y = 1 | X = 1, Z = 0)
p01 <- 0.53 # P(Y = 1 | X = 0, Z = 1)
p11 <- 0.58 # P(Y = 1 | X = 1, Z = 1)
Part a
We can simulate the intervention by generating values for \(X\) independently of \(Z\).
N <- 10000 # number of individuals
set.seed(53201)
part_a <- tibble(z = rbinom(N, 1, r)) %>%
mutate(
x = rbinom(n(), 1, 0.5), # no Z-dependence
p_y_given_x_z = case_when(
x == 0 & z == 0 ~ p00,
x == 0 & z == 1 ~ p01,
x == 1 & z == 0 ~ p10,
x == 1 & z == 1 ~ p11
),
y = rbinom(n(), 1, p_y_given_x_z)
) %>%
group_by(x, y) %>%
summarise(n = n()) %>%
group_by(x) %>%
mutate(p_y_given_do_x = n / sum(n))
x | y | n | p_y_given_do_x |
---|---|---|---|
0 | 0 | 1238 | 0.2470565 |
0 | 1 | 3773 | 0.7529435 |
1 | 0 | 964 | 0.1932251 |
1 | 1 | 4025 | 0.8067749 |
Part b
To simulate observational data, we need to include the dependence of \(X\) on \(Z\).
N <- 100000 # number of individuals
set.seed(95400)
p_x_y_z <- tibble(
id = 1:N,
z = rbinom(N, 1, r),
x = rbinom(N, 1, if_else(z == 0, q0, q1)),
p_y_given_x_z = case_when(
x == 0 & z == 0 ~ p00,
x == 0 & z == 1 ~ p01,
x == 1 & z == 0 ~ p10,
x == 1 & z == 1 ~ p11
),
y = rbinom(N, 1, p_y_given_x_z)
) %>%
group_by(x, y, z) %>%
count() %>%
ungroup() %>%
mutate(p = n / sum(n))
x | y | z | n | p |
---|---|---|---|---|
0 | 0 | 0 | 10723 | 0.10723 |
0 | 0 | 1 | 2020 | 0.02020 |
0 | 1 | 0 | 56015 | 0.56015 |
0 | 1 | 1 | 2205 | 0.02205 |
1 | 0 | 0 | 624 | 0.00624 |
1 | 0 | 1 | 10067 | 0.10067 |
1 | 1 | 0 | 4470 | 0.04470 |
1 | 1 | 1 | 13876 | 0.13876 |
In order to apply the causal effect rule, we’ll need \(\mathbb P(x \mid z)\).
p_x_given_z <- p_x_y_z %>%
group_by(x, z) %>%
summarise(n = sum(n)) %>%
group_by(z) %>%
mutate(p = n / sum(n)) %>%
ungroup()
x | z | n | p |
---|---|---|---|
0 | 0 | 66738 | 0.9290845 |
0 | 1 | 4225 | 0.1499929 |
1 | 0 | 5094 | 0.0709155 |
1 | 1 | 23943 | 0.8500071 |
We can then add the conditional probabilities to the joint distribution table, then sum overal all the \(Z\) variables.
p_y_given_do_x <- p_x_y_z %>%
inner_join(
p_x_given_z,
by = c('x', 'z'),
suffix = c('_num', '_denom')
) %>%
mutate(p = p_num / p_denom) %>%
group_by(x, y) %>%
summarise(p = sum(p))
x | y | p |
---|---|---|
0 | 0 | 0.2500877 |
0 | 1 | 0.7499123 |
1 | 0 | 0.2064264 |
1 | 1 | 0.7935736 |
The causal effect estimates are very close to the simulated intervention.
Part c
We can calculate ACE simply by taking the difference of the causal effect estimates.
ace <- p_y_given_do_x %>%
spread(x, p) %>%
filter(y == 1) %>%
mutate(ace = `1` - `0`) %>%
pull(ace)
ace
[1] 0.04366134
This is different from the overall probability differences.
p_y_given_x <- p_x_y_z %>%
group_by(x, y) %>%
summarise(n = sum(n)) %>%
group_by(x) %>%
mutate(p = n / sum(n)) %>%
select(-n)
risk_difference <- p_y_given_x %>%
spread(x, p) %>%
filter(y == 1) %>%
mutate(rd = `1` - `0`) %>%
pull(rd)
risk_difference
[1] -0.188613
Making \(X\) independent of \(Z\) would minimise the disrepancy between ACE and RD, which would turn the adjustment formula into the formulat for \(\mathbb P(y \mid x\). In other words, setting \(q_0 = q_1 = \mathbb P(X = 1)\) would do the trick.
Part d
Note that the desegregated causal effects
- \(p_{1, 0} - p_{0, 0}\) is 0.04; and
- \(p_{1, 1} - p_{0, 1}\) is 0.05,
are both consisent with our calculation for the overall causal effect, ACE = 4.37%. The generated data are an illustration of Simpson’s reversal because the risk difference, -18.9%, has the opposite sign.