Selecting on a Collider

Author

Andy Grogan-Kaylor

Published

October 31, 2024

1 Introduction

Selecting on a collider may introduce bias.

2 Call Libraries

Show the code
library(ggplot2) # graphics

library(patchwork) # graphics

library(dplyr) # data wrangling

library(sjPlot) # nice tables

set.seed(1234) # set random seed

3 Simulated Data

Show the code
N <- 1000 # sample size

x <- rnorm(N, 100, 10) # randomly distributed x

e <- rnorm(N, 0, 5) # normal error

y <- 1 * x + e # y is a function of x + random error (e)

df <- data.frame(x, y) # data frame

head(df) # replay
          x         y
1  87.92934  81.90268
2 102.77429 104.28163
3 110.84441 103.14869
4  76.54302  79.71988
5 104.29125 107.80601
6 105.06056  95.53114

4 Select Data

We generate a collider: \(z = x + y\) and then select observations if \(z > 215\) (an admittedly arbitrary, but illustrative value).

Show the code
df$z <- df$x + df$y # z is a collider

df$selected <- df$z > 215

df_selected <- df %>% # data selected on collider
  filter(selected)  # selection criterion

5 Graphs

We graph the data to get an idea of the full dataset, with selected and non-selected observations, as well as a graph of the selected data only.

Relationships Differ in the Overall and Selected Data

We note that relationships, as typified by the regression line, are quite different in the overall and non-selected data.

Show the code
p1 <- ggplot(df,
             aes(x = x,
                 y = y)) + 
  geom_point(aes(color = selected)) +
  geom_smooth(method = "lm") +
  labs(title = "y ~ x",
       subtitle = "Entire Sample") +
  scale_color_manual(values = c('TRUE' = 'blue', 
                                'FALSE' = 'red')) +
  ylim(50, 150) + # specify y scale
  theme_minimal()
Show the code
p2 <- ggplot(df_selected,
             aes(x = x,
                 y = y)) + 
  geom_point(aes(color = selected)) +
  geom_smooth(method = "lm") +
  labs(title = "y ~ x",
       subtitle = "Selected On Collider") +
  scale_color_manual(values = c('TRUE' = 'blue', 
                                'FALSE' = 'red')) +
  ylim(50, 150) + # specify y scale
  theme_minimal()
Show the code
p1 + p2

6 Regressions

Similarly, regressions find quite different coefficients for x in the two datasets.

Show the code
fit1 <- lm(y ~ x, data = df)

fit2 <- lm(y ~ x, data = df_selected)

tab_model(fit1, fit2,
          dv.labels = c("Entire Sample", "Selected on Collider"))
  Entire Sample Selected on Collider
Predictors Estimates CI p Estimates CI p
(Intercept) -2.71 -5.76 – 0.35 0.083 25.68 13.09 – 38.27 <0.001
x 1.03 1.00 – 1.06 <0.001 0.79 0.68 – 0.90 <0.001
Observations 1000 228
R2 / R2 adjusted 0.814 / 0.814 0.462 / 0.460