11 Feature Engineering

Today we’ll focus on what a model sees by building and transforming features (explanatory variables in the model). We’ll also use cross-validation to test whether these choices improve prediction.

To review, here’s what we’ve been doing over the past few classes:

We introduced K-Nearest Neighbors (KNN), a non-parametric model that predicts an outcome based on the values of the dependent variable for the K closest observations.
We saw that if we train and test on the same data and let $K = 1$, the model achieves an AUC of 1. While this looked like perfect performance, it was just the fact that each observation was its own single nearest neighbor, so the model simply memorized the data set. This highlighted why we need separate training and test sets. Once we made that change, the LPM, Logit, and KNN all achieved similar performance, with AUCs around 0.76 to 0.78.
Then we connected this to the bias/variance tradeoff:
- The LPM and Logit are high-bias, low-variance models. They rely on strong functional form assumptions, which introduce bias. However, their predictions are stable across different samples.
- KNN is a low-bias, high-variance model. It makes very few assumptions about the shape of the relationship, which reduces bias. But its predictions can change substantially with small changes in the data, leading to high variance. Because distance matters, we also need to standardize variables so that each feature is on a comparable scale.

This class, we’ll learn about:

Cross-validation: instead of relying on a single train/test split, we’ll use multiple splits of the data to get a more reliable measure of model performance. Here’s how it works:
- Randomly assign each observation to a “fold” (numbers 1 through 5).
- Save fold 1 to test the model; train the model on the other 4 folds.
- Save fold 2 to test the model; train the model on the other 4 folds.
- Save fold 3 to test the model; train the model on the other 4 folds.
- Save fold 4 to test the model; train the model on the other 4 folds.
- Save fold 5 to test the model; train the model on the other 4 folds.
- Average the 5 resulting AUCs: this becomes the “Cross-Validation AUC” (CV AUC), which will be more trustworthy than the AUC on a single train/test split.
Feature engineering: We’ll create new variables from the ones we already have, like logs, squares, or ratios, to help our models better capture patterns in the data.

Part 1: Cross-Validation

Based on the description of cross-validation above, fill in the blanks to write a function that does cross-validation to find the AUC five times across different train/test splits and then averages those five AUCs.

cross_validate <- function(data, molfe) {
  
  results <- map(
    .x = 1:___,
    .f = function(fold_num) {
      
      train <- data %>% filter(fold != ___)
      test  <- data %>% filter(fold == ___)
      
      predictions <- if (model == "lpm") {
        predict_lpm(train, test)
      } else if (model == "logit") {
        predict_logit(train, test)
      } else if (model == "knn") {
        predict_knn(train, test, k = 60)
      }
      
      predictions %>%
        make_roc() %>%
        auc()
    }
  )
  
  ___(unlist(results))
}

# Test: this should give you a single number for the CV AUC.
mortgage %>%
  select(approved, income, loan_amount) %>%
  mutate(fold = sample(1:5, size = nrow(mortgage), replace = T)) %>%
  cross_validate("lpm")

Part 2: Feature Engineering

Step 1: Estimate the Baseline Model

Let the “baseline” model be predictive but simple: it’s what a reasonable lender might use, avoiding demographic variables like age, sex, race, and ethnicity.

set.seed(1234)

# Notice below: piping into the curly brackets {} pipes the data into 3 branching functions: cross_validate(., "lpm"), cross_validate(., "logit"), and cross_validate(., "knn").
mortgage %>%
  # add variables you want to include in your model to the select statement:
  select(___) %>%
  mutate(fold = sample(1:5, size = nrow(.), replace = T)) %>%
  {
    c(
      lpm = cross_validate(., "lpm"),
      logit = cross_validate(., "logit"),
      knn = cross_validate(., "knn")
    )
  }

Baseline Model

Copy-paste the CV AUC for our three models for the baseline:

   lpm     logit       knn 
   ____________________________

Step 2: Interactions

Before estimating more complicated models, let’s consider the data carefully from a lender’s perspective. We want to create features (explanatory variables) that capture financially risky combinations, not just single risky variables. For example, a high loan amount may not be risky for a high-income borrower, and a high debt-to-income ratio may mean something different for a home purchase than for a cash-out refinance.

Interactions

Interactions ask the question: what combinations of variables make for better prediction? In this case, we want to include any interactions that could add up to a financially stressed borrower. Explain why each of these interactions might be predictive:

loan_income_ratio = loan_amount / income
cash_out_high_dti = (loan_purpose == "Cash-out refinance") * debt_to_income_ratio
home_purchase_loan_income = (loan_purpose == "Home purchase") * loan_income_ratio

Add the interactions to the model and compare whether the CV AUC improved.

set.seed(1234)

mortgage %>%
  select(_____) %>%
  mutate(
    loan_income_ratio = ___,
    cash_out_high_dti = ___,
    home_purchase_loan_income = ___
  ) %>%
  mutate(fold = sample(1:5, size = nrow(.), replace = T)) %>%
  {
    c(
      lpm = cross_validate(., "lpm"),
      logit = cross_validate(., "logit"),
      knn = cross_validate(., "knn")
    )
  }

Interactions Model

Copy-paste the CV AUC for our three models when we include interactions:

   lpm     logit       knn 
   ____________________________

Step 3: Buckets

The next feature engineering task we’ll explore is to create buckets, which help capture threshold effects: situations where risk jumps when a variable crosses a certain level.

For example, a debt-to-income ratio crossing the threshold of 22% vs 23% might not matter, but 49% to 50% might matter a lot.

What threshold is most predictive for debt_to_income_ratio? What about loan_to_income?

set.seed(1234)

mortgage %>%
  select(___) %>%
  mutate(
    high_debt_to_income = debt_to_income_ratio > ___,
    high_loan_to_income = (loan_amount / income) > ___
  ) %>%
  mutate(fold = sample(1:5, size = nrow(.), replace = T)) %>%
  {
    c(
      lpm = cross_validate(., "lpm"),
      logit = cross_validate(., "logit"),
      knn = cross_validate(., "knn")
    )
  }

Buckets Model

Copy-paste the CV AUC for our three models when we include buckets:

   lpm     logit       knn 
   __________________________

Step 4: Logs and Squared Terms

The linear probability model assumes variables affect approval in a linear way; the logit assumes variables affect log odds in a linear way. If we want to account for the possibility of diminishing marginal returns, or variables to become risky at an increasing rate, we can use logs and squared terms:

Income may have diminishing marginal returns: going from $50k to $100k matters more than going from $200k to $250k
Very large loans may become risky at an increasing rate
Risk may increase faster at extreme values

Logs and Squared Terms

Which variables are more likely to have diminishing marginal returns, and which variables are more likely to be associated with risk that’s increasing at an increasing rate? Discuss.

Add these transformations to the model and compare whether the CV AUC improves.

set.seed(1234)

mortgage %>%
  select(___) %>%
  mutate(
    log_income = log(income),
    log_loan_amount = log(loan_amount + 1),
    log_property_value = log(property_value + 1),
    dti_sq = debt_to_income_ratio^2,
    ltv_sq = loan_to_value_ratio^2
  ) %>%
  mutate(fold = sample(1:5, size = nrow(.), replace = T)) %>%
  {
    c(
      lpm = cross_validate(., "lpm"),
      logit = cross_validate(., "logit"),
      knn = cross_validate(., "knn")
    )
  }

Step 5: Do we need more data?

We have 36553 observations in this mortgage data set (after cleaning and filtering), which represents families who applied for mortgages in 2024 in Riverside, CA. If we added data from 2023 as well, how much better would our predictions get?

Go to https://ffiec.cfpb.gov/data-browser/data/2023?category=counties&items=06065 - 2023 - Riverside - leave steps 2 and 3 blank Download and rename the file county_06065_2023.csv. At the top of this document where we define mortgage, replace the first line with:

mortgage <- read_csv("~/Downloads/county_06065.csv") %>%
  bind_rows(read_csv("~/Downloads/county_06065_2023.csv") %>% mutate(income = as.double(income)))

More Data

How does the CV AUC change for each of these models when we double the amount of data?

   lpm     logit       knn 
   __________________________

Review Questions

In your own words, explain cross-validation. How is it better than evaluating a model on a single train/test split?
When would you want to be sure to include interactions in a model? Did interactions help us a lot or only a little here?
When would you want to be sure to include buckets in a model? Did buckets help us a lot or only a little here?
When would you want to be sure to include logs and squared terms in a model? Did they help us a lot or only a little here?
What are signs that you might need more data to train your model? Did doubling the amount of data help us a lot or only a little here?

Download this assignment

Here’s a link to download this assignment.