Untitled

Introduction

When I first learned regression, we would look at a model like this:

\[Income_i = \beta_1 \times Education_i + \epsilon_i\]

We’d then say that for every additional year of Education, Income would be expected to increase by \(\beta_1\). We’d then add a control for Age:

\[Income_i = \beta_1 \times Education_i + \beta_2 \times Age_i + \epsilon_i\]

And then something mysterious would happen. We would now say that for every additional year of Education, Income would be expected to increase by \(\beta_1\), controlling for Age. What did this mean, “controlling for age”? Other, no less mysterious versions of this phrase include “adjusting for age”, or “holding age constant”.

Some textbooks would discuss controlling for age in terms of comparisons. Here’s Gelman in Regression and other stories:

We interpret the regression slopes as comparisons of individuals that differ in one predictor while being at the same levels of the other predictors.

Or from The Effect:

If two observations have the same values of the other variables in the model, but one has a value of𝑋 that is one unit higher, the observation with the X one unit higher will on average have a𝑌that is B1 units higher

This was much more intuitive to me. In our case, we’d say that comparing two people of identical age, we’d expect someone with an additional year of education to have \(\beta_1\) more in earnings.

But how is regression doing this? What about the regression makes this comparison possible?

Here are two ways of thinking about this that clarified regression controls, and what they can and can’t tell us.

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Let’s look at some survey data, a sample of the GSS from socviz:

library(socviz)
gss_sm |> 
  select(id, year, obama, childs, age)
# A tibble: 2,867 × 5
      id  year obama childs   age
   <dbl> <dbl> <dbl>  <dbl> <dbl>
 1     1  2016     0      3    47
 2     2  2016     1      0    61
 3     3  2016     0      2    72
 4     4  2016     0      4    43
 5     5  2016     1      2    55
 6     6  2016     1      2    53
 7     7  2016    NA      2    50
 8     8  2016    NA      3    23
 9     9  2016    NA      3    45
10    10  2016     0      4    71
# ℹ 2,857 more rows

“What’s left over”

mod1 = lm(childs ~ age, data = gss_sm)
modelsummary::modelsummary(mod1)
 (1)
(Intercept) 0.153
(0.086)
age 0.035
(0.002)
Num.Obs. 2849
R2 0.135
R2 Adj. 0.134
AIC 10593.5
BIC 10611.4
Log.Lik. −5293.751
F 442.679
RMSE 1.55

Let’s look at the residuals:

augmented_data = broom::augment(mod1) 

augmented_data |> 
  select(.rownames, childs, age, .fitted, .resid) |> 
  sample_n(5) |> 
  kableExtra::kable(digits = 2)
.rownames childs age .fitted .resid
358 0 60 2.23 -2.23
1100 2 56 2.09 -0.09
2724 1 57 2.12 -1.12
1739 1 69 2.54 -1.54
2330 3 79 2.88 0.12
mod2 = lm(obama ~ age, data = gss_sm)
mod2

Call:
lm(formula = obama ~ age, data = gss_sm)

Coefficients:
(Intercept)          age  
    0.80149     -0.00328  
broom::augment(mod2)
# A tibble: 1,728 × 9
   .rownames obama   age .fitted .resid     .hat .sigma  .cooksd .std.resid
   <chr>     <dbl> <dbl>   <dbl>  <dbl>    <dbl>  <dbl>    <dbl>      <dbl>
 1 1             0    47   0.647 -0.647 0.000676  0.481 0.000612     -1.35 
 2 2             1    61   0.601  0.399 0.000687  0.481 0.000236      0.828
 3 3             0    72   0.565 -0.565 0.00127   0.481 0.000878     -1.18 
 4 4             0    43   0.660 -0.660 0.000823  0.481 0.000776     -1.37 
 5 5             1    55   0.621  0.379 0.000582  0.481 0.000180      0.787
 6 6             1    53   0.628  0.372 0.000580  0.481 0.000174      0.774
 7 10            0    71   0.569 -0.569 0.00120   0.481 0.000837     -1.18 
 8 13            1    32   0.697  0.303 0.00157   0.481 0.000314      0.631
 9 14            1    60   0.605  0.395 0.000659  0.481 0.000222      0.822
10 15            0    76   0.552 -0.552 0.00161   0.481 0.00106      -1.15 
# ℹ 1,718 more rows