library(tidyverse)
# You may need to change the file path (this one is for macs)
mortgage <- read_csv("~/Downloads/county_06065.csv") %>%
filter(action_taken %in% c(1, 2, 3)) %>%
mutate(age = factor(applicant_age,
levels = c("<25", "25-34", "35-44",
"45-54", "55-64", "65-74", ">74", "8888")),
age = if_else(age == "8888", NA, age),
approved = if_else(action_taken %in% c(1, 2), 1, 0)
) %>%
mutate(
loan_amount = loan_amount/1000,
derived_race = factor(derived_race),
debt_to_income_ratio = case_when(
debt_to_income_ratio == "20%-<30%" ~ 25,
debt_to_income_ratio == "30%-<36%" ~ 33,
debt_to_income_ratio == "50%-60%" ~ 55,
debt_to_income_ratio == "<20%" ~ 15,
debt_to_income_ratio == ">60%" ~ 65,
debt_to_income_ratio %in% as.character(36:49) ~ as.numeric(debt_to_income_ratio),
.default = NA
),
debt_to_income_ratio = as.numeric(debt_to_income_ratio),
loan_term = as.numeric(loan_term),
property_value = as.numeric(property_value),
interest_rate = as.numeric(interest_rate)
) %>%
select(income, age, derived_race,
loan_amount, interest_rate, property_value,
debt_to_income_ratio, loan_term, approved,
tract_minority_population_percent,
tract_median_age_of_housing_units)5 Linear Probability Model
Today, we’ll use lm() to estimate a linear probability model to predict whether someone’s mortgage application gets approved.
Because the dependent variable is binary (approved vs. not approved), the fitted values from the model can b interpreted as predicted probabilities.
Goals:
- Understand what the coefficients in a linear probability model mean
- Use the model to make predictions for specific applicants
- Identify the key limitation of the linear probability model
Step 0: Download the Data
Download and unzip 2024 mortgage data for Riverside County, CA:
https://ffiec.cfpb.gov/data-browser/data/2024?category=counties&items=06065
Leave steps 2 and 3 blank (we want all financial institutions and we don’t want to filter).
Then take a look at the data dictionary for the data you downloaded:
https://ffiec.cfpb.gov/documentation/publications/loan-level-datasets/lar-data-fields
Step 1: Data Exploration
Outcome variable: what should the dependent variable be in our prediction model? How is it constructed?
Pick one variable you think will strongly predict approval. Plot its distribution.
- Describe how your chosen variable relates to approval, using dplyr to summarize the relationship and also using ggplot2 to visualize it. Add a line of best fit using
geom_smooth(method = lm).
- Find applicants most similar to you: how many of those people were approved? Now change one variable. Does the approval rate change a lot?
Step 2: Estimating the Linear Probability Model
- Interpret this baseline model:
mortgage %>%
lm(approved ~ income, data = .) %>%
broom::tidy()Interpretation:
- A person’s probability of being approved when their income is zero is estimated to be ___%.
- when income increases by 1 unit ($1000), your chance of being approved increases by , which is statistically different from zero at the % level.
- Add
loan_amountand the interaction betweenloan_amountandincometo the model by estimatingapproved ~ income*loan_amountand interpret your results.
mortgage %>%
lm(approved ~ income*loan_amount, data = .) %>%
broom::tidy()Interpretation:
- A person’s probability of being approved when their income is zero and their loan amount is zero is estimated to be ___.
- The higher your income, the (higher/lower) the chance you are approved.
- The higher your loan amount, the (higher/lower) the chance you are approved.
- But the sign of the coefficient on the interaction indicates that income and loan amount are (complements/substitutes): Higher income increases approval probability, but less so for larger loans.
- Include all variables in a large model of
approved. Are any coefficient signs surprising?
mortgage %>%
lm(approved ~ income + age + derived_race + loan_amount +
property_value + debt_to_income_ratio + loan_term + approved + tract_minority_population_percent + tract_median_age_of_housing_units, data = .) %>%
broom::tidy() %>%
view()- Add the model predictions to the data set using the function
fitted.values, which takes an lm object. Are any probabilities invalid (<0 or >1)?
mortgage %>%
select(___) %>%
drop_na() %>%
mutate(prediction = fitted.values(lm(____, data = .)))Find the 10 people with the highest predicted probability of approval. Why are their predictions so high?
Find the 10 people with the lowest predicted probability of approval. Why are their predictions so low?
Write down some notes about issues with the data that we should resolve before taking our results seriously. What variables have values that don’t make sense?
Over the next few weeks, among other things, we’ll develop a theory of how to evaluate prediction quality when it comes to this mortgage problem. Write down your ideas: what kinds of metrics and procedures could work?
Download this assignment
Here’s a link to download this assignment.