Instrumental Variables and Endogeneity Part 2: Application in R

Ari Fenn, PhD, Researcher
September 22, 2021

Saws and other tools — Photo by By Sneaky Elbow on UnSplash

In my last blog post, I introduced the concept of instrumental variables (IV) to address endogenous explanatory variables.

In this blog post, I will demonstrate how to implement an IV regression in the R environment. I will go back to the first example, Angrist and Krueger (1991), introduced in my last blog post. In this scenario, education is endogenous when estimating the returns to education and is instrumented by the birth quarter. The data is available with this link. The data set is ‘NEW7080.rar’. Unzip this to your working directory. The data file is ‘NEW7080.dta’. This is a Stata data file, but R can easily load Stata data files with the “haven” package.

wages <- haven::read_dta('NEW7080.dta')

This file includes the raw data, and the variables need to be renamed, and some need to be created. Variable names are found with this link. Variable creation is done with the “.do” files on the same page as the data; it is shown below for those who do not have access to Stata. For those with access to Stata, feel free to skip to the next section. For workflow reasons, I prefer to use the “tidyverse” for data cleaning and processing. This demonstration will replicate the results of Table IV columns 1 and 4 from Angrist and Krueger (1991), the oldest cohort or those born in the 1920s.

library(tidyverse)

# Rename variables, from QOB Table IV.do file and Descriptive Statistics QOB.txt

      wages <- wages %>%

      rename(AGE = v1, 

      AGEQ = v2, 

      EDUC = v4, 

      ENOCENT = v5, 

      ESOCENT = v6, 

      LWKLYWGE = v9, 

      MARRIED = v10, 

      MIDATL = v11, 

      MT = v12, 

      NEWENG = v13, 

      CENSUS = v16, 

      QOB = v18, 

      RACE = v19, 

      SMSA = v20, 

      SOATL = v21, 

      WNOCENT = v24, 

      WSOCENT = v25, 

      YOB = v27)

      

      # Variable Creation, from QOB Table IV.do

      

      wages <- wages %>%

      mutate(COHORT = case_when(YOB %in% 30:39~ 30.39, 

      # Side note, this is much more efficient than Stata

      YOB %in% 40:49~ 40.49,

      TRUE ~ 20.29),

      AGEQ = if_else(CENSUS == 80, AGEQ - 1900, AGEQ),

      AGEQSQ = AGEQ*AGEQ) %>%

      filter(COHORT == 20.29) 

      # Since I am only using the Cohort of those born in the 1920s I will filter here and use R's better data processing 

      

      wages <- wages %>% 

      mutate(YOBQ = paste(YOB, QOB, sep = 'Q'))

      # This creates a year quarter column which factor() can turn into dummy variables for the regression

Now that the data are in the format needed for analysis, I will recreate the first column of Table IV, the OLS results for return to education with only age, age squared, and year of birth.

# This is the returns to education estimated with an OLS regression

      ols_returns <- lm(LWKLYWGE ~ EDUC + AGEQ + AGEQSQ + factor(YOB),

      data = wages)

      

      print(summary(ols_returns))

## 

      ## Call:

      ## lm(formula = LWKLYWGE ~ EDUC + AGEQ + AGEQSQ + factor(YOB), data = wages)

      ## 

      ## Residuals:

      ##     Min      1Q  Median      3Q     Max 

      ## -5.6997 -0.2193  0.0555  0.3075  4.3934 

      ## 

      ## Coefficients:

      ##                   Estimate Std. Error t value Pr(>|t|) 

      ## (Intercept)      0.8582916  1.5210342   0.564   0.5726 

      ## EDUC             0.0801676  0.0003553 225.645   <2e-16 ***

      ## AGEQ             0.1445517  0.0675997   2.138   0.0325 * 

      ## AGEQSQ          -0.0015423  0.0007478  -2.062   0.0392 * 

      ## factor(YOB)1921 -0.0015849  0.0092156  -0.172   0.8635 

      ## factor(YOB)1922 -0.0112386  0.0147238  -0.763   0.4453 

      ## factor(YOB)1923 -0.0097366  0.0195457  -0.498   0.6184 

      ## factor(YOB)1924 -0.0065891  0.0235583  -0.280   0.7797 

      ## factor(YOB)1925  0.0031614  0.0268668   0.118   0.9063 

      ## factor(YOB)1926  0.0098742  0.0297096   0.332   0.7396 

      ## factor(YOB)1927  0.0194091  0.0323882   0.599   0.5490 

      ## factor(YOB)1928  0.0311071  0.0353367   0.880   0.3787 

      ## factor(YOB)1929  0.0247372  0.0390484   0.634   0.5264 

      ## ---

      ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      ## 

      ## Residual standard error: 0.593 on 247186 degrees of freedom

      ## Multiple R-squared:  0.1711, Adjusted R-squared:  0.171 

      ## F-statistic:  4251 on 12 and 247186 DF,  p-value: < 2.2e-16

From the OLS regression estimates, each year of education is associated with an additional 8% in weekly wages. Now I want to test the returns when accounting for the endogenous nature of education. The easiest way to do this is with the “ivreg” command from the “AER” package. The equation with the “~” is the second stage and to the left of the “|” on the right is “.-educ + factor(YOBQ)” this brings the right-hand side of the second stage over, drops the education variable (and tells R it is now the dependent variable) then adds the instruments.

library(AER)

iv_returns <- ivreg(LWKLYWGE ~ EDUC + AGEQ + AGEQSQ + factor(YOB) | .- EDUC + factor(YOBQ),

      data = wages)

      

      print(summary(iv_returns), diagnostics=TRUE)

## 

      ## Call:

      ## ivreg(formula = LWKLYWGE ~ EDUC + AGEQ + AGEQSQ + factor(YOB) | 

      ##     . - EDUC + factor(YOBQ), data = wages)

      ## 

      ## Residuals:

      ##      Min       1Q   Median       3Q      Max 

      ## -6.03131 -0.25869  0.04832  0.33655  4.77338 

      ## 

      ## Coefficients:

      ##                   Estimate Std. Error t value Pr(>|t|) 

      ## (Intercept)      0.0195429  1.6754613   0.012   0.9907 

      ## EDUC             0.1310912  0.0333305   3.933 8.39e-05 ***

      ## AGEQ             0.1409116  0.0703932   2.002   0.0453 * 

      ## AGEQSQ          -0.0013603  0.0007874  -1.728   0.0841 . 

      ## factor(YOB)1921  0.0052175  0.0105739   0.493   0.6217 

      ## factor(YOB)1922  0.0095534  0.0204935   0.466   0.6411 

      ## factor(YOB)1923  0.0195442  0.0279470   0.699   0.4843 

      ## factor(YOB)1924  0.0327911  0.0355724   0.922   0.3566 

      ## factor(YOB)1925  0.0560697  0.0445070   1.260   0.2077 

      ## factor(YOB)1926  0.0707575  0.0504361   1.403   0.1606 

      ## factor(YOB)1927  0.0946631  0.0596822   1.586   0.1127 

      ## factor(YOB)1928  0.1138412  0.0654557   1.739   0.0820 . 

      ## factor(YOB)1929  0.1134914  0.0708922   1.601   0.1094 

      ## ---

      ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      ## 

      ## Residual standard error: 0.6171 on 247186 degrees of freedom

      ## Multiple R-Squared: 0.1022,  Adjusted R-squared: 0.1021 

      ## Wald test: 8.675 on 12 and 247186 DF,  p-value: < 2.2e-16

First, the IV estimates of the returns are higher than the OLS and the same as Angrist and Krueger (1991). Now that we have both IV and OLS estimates, the goal is to ensure that IV is the correct choice. First, we have a convincing instrument, birth quarter, which for the early and midpart of the 20th century forced those born at the end of the year into an extra year of schooling. In contrast, those born at the beginning of the year could choose between wage-earning activities or school. Next, we need to answer if the instruments are relevant. Relevance means that the instruments are strongly correlated with the endogenous variable. Finally, we need to obtain the first stage estimates and the first stage estimates without the instruments to test relevance. Obtaining these estimates is done with simple OLS regressions and an F-test of the models:

stage_one <- lm(EDUC ~ AGEQ + AGEQSQ + factor(YOB) + factor(YOBQ),

      data = wages)

      

      stage_one_noiv <- lm(EDUC ~ AGEQ + AGEQSQ + factor(YOB), 

      data = wages)

      

      rel_test <- anova(stage_one, stage_one_noiv, test = 'F')

      print(rel_test)

## Analysis of Variance Table

      ## 

      ## Model 1: EDUC ~ AGEQ + AGEQSQ + factor(YOB) + factor(YOBQ)

      ## Model 2: EDUC ~ AGEQ + AGEQSQ + factor(YOB)

      ##   Res.Df     RSS  Df Sum of Sq     F Pr(>F)

      ## 1 247158 2785343 

      ## 2 247187 2785685 -29   -342.82 1.049 0.3933

The p-value is very large, so we fail to reject the null hypothesis that the instruments are irrelevant. If we had found that the instruments are relevant, we would next want to test for validity. For an instrument to be valid, it must be uncorrelated with the regression error term—the problem we were trying to correct in the first place. To test for validation, a Sargan test is employed. This test is done with an OLS regression. The residuals from the second stage IV regression enter as the dependent variables with the exogenous variables and instruments as the independent variables. The null hypothesis here is that the instrument is valid.

sargan <- lm(iv_returns$residuals ~ AGEQ + AGEQSQ + factor(YOB) + factor(YOBQ),
      data = wages)
      
      sargan_sum <- summary(sargan)
      
      sargan_test <- sargan_sum$r.squared*nrow(wages)
      
      print(1-pchisq(sargan_test,1))

## [1] 4.228102e-07

We have an extremely small p-value, we reject the null hypothesis, and our instrument is invalid. This invalid result should not be surprising given the results of the test for relevance.

This demonstration was based on a paper published in 1991 and subject to several publications that questioned the appropriateness of the chosen instrument. This blog post also demonstrates how hard selecting an instrument can be. Despite this, I showed how to implement an IV regression in R and appropriate tests of IV regressions.

Reference

Angrist, J. D., & Krueger, A. B. (1991). Does Compulsory School Attendance Affect Schooling and Earnings? The Quarterly Journal of Economics, 106(4), 979–1014. https://doi.org/10.2307/2937954