5 Linear Probability Model

Today, we’ll use lm() to estimate a linear probability model to predict whether someone’s mortgage application gets approved.

Because the dependent variable is binary (approved vs. not approved), the fitted values from the model can b interpreted as predicted probabilities.

Goals:

Understand what the coefficients in a linear probability model mean
Use the model to make predictions for specific applicants
Identify the key limitation of the linear probability model

Step 0: Download the Data

Download and unzip 2024 mortgage data for Riverside County, CA:

https://ffiec.cfpb.gov/data-browser/data/2024?category=counties&items=06065

Leave steps 2 and 3 blank (we want all financial institutions and we don’t want to filter).

Then take a look at the data dictionary for the data you downloaded:

https://ffiec.cfpb.gov/documentation/publications/loan-level-datasets/lar-data-fields

library(tidyverse)

# You may need to change the file path (this one is for macs)
mortgage <- read_csv("~/Downloads/county_06065.csv") %>%
  filter(action_taken %in% c(1, 2, 3)) %>%
  mutate(age = factor(applicant_age, 
               levels = c("<25", "25-34", "35-44",
               "45-54", "55-64", "65-74", ">74", "8888")),
         age = if_else(age == "8888", NA, age),
         approved = if_else(action_taken %in% c(1, 2), 1, 0)
         ) %>%
  mutate(
    loan_amount = loan_amount/1000, 
    derived_race = factor(derived_race), 
    debt_to_income_ratio = case_when(
      debt_to_income_ratio == "20%-<30%" ~ 25,
      debt_to_income_ratio == "30%-<36%" ~ 33,
      debt_to_income_ratio == "50%-60%" ~ 55,
      debt_to_income_ratio == "<20%" ~ 15,
      debt_to_income_ratio == ">60%" ~ 65,
      debt_to_income_ratio %in% as.character(36:49) ~ as.numeric(debt_to_income_ratio),
      .default = NA
    ),
    debt_to_income_ratio = as.numeric(debt_to_income_ratio),
    loan_term = as.numeric(loan_term),
    property_value = as.numeric(property_value),
    interest_rate = as.numeric(interest_rate)
    ) %>%
  select(income, age, derived_race, 
    loan_amount, interest_rate, property_value,
    debt_to_income_ratio, loan_term, approved,
    tract_minority_population_percent,
    tract_median_age_of_housing_units)

Step 1: Data Exploration

Outcome variable: what should the dependent variable be in our prediction model? How is it constructed?
Pick one variable you think will strongly predict approval. Plot its distribution.

Describe how your chosen variable relates to approval, using dplyr to summarize the relationship and also using ggplot2 to visualize it. Add a line of best fit using geom_smooth(method = lm).

Find applicants most similar to you: how many of those people were approved? Now change one variable. Does the approval rate change a lot?

Step 2: Estimating the Linear Probability Model

Interpret this baseline model:

mortgage %>%
  lm(approved ~ income, data = .) %>%
  broom::tidy()

Interpretation:

A person’s probability of being approved when their income is zero is estimated to be ___%.
when income increases by 1 unit ($1000), your chance of being approved increases by , which is statistically different from zero at the % level.

Add loan_amount and the interaction between loan_amount and income to the model by estimating approved ~ income*loan_amount and interpret your results.

mortgage %>%
  lm(approved ~ income*loan_amount, data = .) %>%
  broom::tidy()

Interpretation:

A person’s probability of being approved when their income is zero and their loan amount is zero is estimated to be ___.
The higher your income, the (higher/lower) the chance you are approved.
The higher your loan amount, the (higher/lower) the chance you are approved.
But the sign of the coefficient on the interaction indicates that income and loan amount are (complements/substitutes): Higher income increases approval probability, but less so for larger loans.

Include all variables in a large model of approved. Are any coefficient signs surprising?

mortgage %>%
  lm(approved ~ income + age + derived_race + loan_amount +
       property_value + debt_to_income_ratio + loan_term + approved + tract_minority_population_percent + tract_median_age_of_housing_units, data = .) %>%
  broom::tidy() %>%
  view()

Add the model predictions to the data set using the function fitted.values, which takes an lm object. Are any probabilities invalid (<0 or >1)?

mortgage %>%
  select(___) %>%
  drop_na() %>%
  mutate(prediction = fitted.values(lm(____, data = .)))

Find the 10 people with the highest predicted probability of approval. Why are their predictions so high?
Find the 10 people with the lowest predicted probability of approval. Why are their predictions so low?
Write down some notes about issues with the data that we should resolve before taking our results seriously. What variables have values that don’t make sense?
Over the next few weeks, among other things, we’ll develop a theory of how to evaluate prediction quality when it comes to this mortgage problem. Write down your ideas: what kinds of metrics and procedures could work?

Download this assignment

Here’s a link to download this assignment.