---
title: "5 Linear Probability Model"
format:
  html:
    self-contained: TRUE
---

Today, we'll use `lm()` to estimate a **linear probability model** to predict whether someone's mortgage application gets approved. 

Because the dependent variable is binary (approved vs. not approved), the fitted values from the model can b interpreted as **predicted probabilities**.

Goals:

- Understand what the coefficients in a linear probability model mean
- Use the model to make predictions for specific applicants
- Identify the key limitation of the linear probability model

## Step 0: Download the Data

Download and unzip 2024 mortgage data for Riverside County, CA:

https://ffiec.cfpb.gov/data-browser/data/2024?category=counties&items=06065

Leave steps 2 and 3 blank (we want all financial institutions and we don't want to filter).

Then take a look at the data dictionary for the data you downloaded: 

https://ffiec.cfpb.gov/documentation/publications/loan-level-datasets/lar-data-fields

```{r}
library(tidyverse)

# You may need to change the file path (this one is for macs)
mortgage <- read_csv("~/Downloads/county_06065.csv") %>%
  filter(action_taken %in% c(1, 2, 3)) %>%
  mutate(age = factor(applicant_age, 
               levels = c("<25", "25-34", "35-44",
               "45-54", "55-64", "65-74", ">74", "8888")),
         age = if_else(age == "8888", NA, age),
         approved = if_else(action_taken %in% c(1, 2), 1, 0)
         ) %>%
  mutate(
    loan_amount = loan_amount/1000, 
    derived_race = factor(derived_race), 
    debt_to_income_ratio = case_when(
      debt_to_income_ratio == "20%-<30%" ~ 25,
      debt_to_income_ratio == "30%-<36%" ~ 33,
      debt_to_income_ratio == "50%-60%" ~ 55,
      debt_to_income_ratio == "<20%" ~ 15,
      debt_to_income_ratio == ">60%" ~ 65,
      debt_to_income_ratio %in% as.character(36:49) ~ as.numeric(debt_to_income_ratio),
      .default = NA
    ),
    debt_to_income_ratio = as.numeric(debt_to_income_ratio),
    loan_term = as.numeric(loan_term),
    property_value = as.numeric(property_value),
    interest_rate = as.numeric(interest_rate)
    ) %>%
  select(income, age, derived_race, 
    loan_amount, interest_rate, property_value,
    debt_to_income_ratio, loan_term, approved,
    tract_minority_population_percent,
    tract_median_age_of_housing_units)
```

## Step 1: Data Exploration

a) Outcome variable: what should the dependent variable be in our prediction model? How is it constructed?

b) Pick one variable you think will strongly predict approval. Plot its distribution.

```{r}

```

c) Describe how your chosen variable relates to approval, using dplyr to summarize the relationship and also using ggplot2 to visualize it. Add a line of best fit using `geom_smooth(method = lm)`.

```{r}

```

d) Find applicants most similar to you: how many of those people were approved? Now change one variable. Does the approval rate change a lot?

```{r}

```

## Step 2: Estimating the Linear Probability Model

a) Interpret this baseline model:

```{r}
mortgage %>%
  lm(approved ~ income, data = .) %>%
  broom::tidy()
```

Interpretation: 

- A person's *probability of being approved when their income is zero* is estimated to be ___%.
- when income increases by 1 unit ($1000), your chance of being approved increases by ___, which is statistically different from zero at the ___% level.

b) Add `loan_amount` and the interaction between `loan_amount` and `income` to the model by estimating `approved ~ income*loan_amount` and interpret your results.

```{r}
mortgage %>%
  lm(approved ~ income*loan_amount, data = .) %>%
  broom::tidy()
```

Interpretation:

- A person's *probability of being approved when their income is zero and their loan amount is zero* is estimated to be ___.
- The higher your income, the (higher/lower) the chance you are approved.
- The higher your loan amount, the (higher/lower) the chance you are approved.
- But the sign of the coefficient on the interaction indicates that income and loan amount are (complements/substitutes): Higher income increases approval probability, but less so for larger loans.

c) Include all variables in a large model of `approved`. Are any coefficient signs surprising?

```{r}
mortgage %>%
  lm(approved ~ income + age + derived_race + loan_amount +
       property_value + debt_to_income_ratio + loan_term + approved + tract_minority_population_percent + tract_median_age_of_housing_units, data = .) %>%
  broom::tidy() %>%
  view()
```

d) Add the **model predictions** to the data set using the function `fitted.values`, which takes an lm object. Are any probabilities invalid (<0 or >1)?

```{r}
mortgage %>%
  select(___) %>%
  drop_na() %>%
  mutate(prediction = fitted.values(lm(____, data = .)))
```

e) Find the 10 people with the highest predicted probability of approval. Why are their predictions so high?

f) Find the 10 people with the lowest predicted probability of approval. Why are their predictions so low?

g) Write down some notes about issues with the data that we should resolve before taking our results seriously. What variables have values that don't make sense?

h) Over the next few weeks, among other things, we'll develop a theory of how to evaluate prediction quality when it comes to this mortgage problem. Write down your ideas: what kinds of metrics and procedures could work?