And then something mysterious would happen. We would now say that for every additional year of Education, Income would be expected to increase by \(\beta_1\), controlling for Age. What did this mean, “controlling for age”? Other, no less mysterious versions of this phrase include “adjusting for age”, or “holding age constant”.
Some textbooks would discuss controlling for age in terms of comparisons. Here’s Gelman in Regression and other stories:
We interpret the regression slopes as comparisons of individuals that differ in one predictor while being at the same levels of the other predictors.
Or from The Effect:
If two observations have the same values of the other variables in the model, but one has a value of𝑋 that is one unit higher, the observation with the X one unit higher will on average have a𝑌that is B1 units higher
This was much more intuitive to me. In our case, we’d say that comparing two people of identical age, we’d expect someone with an additional year of education to have \(\beta_1\) more in earnings.
But how is regression doing this? What about the regression makes this comparison possible?
Here are two ways of thinking about this that clarified regression controls, and what they can and can’t tell us.
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Let’s look at some survey data, a sample of the GSS from socviz: