11 Feature Engineering

Today we’ll focus on what a model sees by building and transforming features (explanatory variables in the model). We’ll also use cross-validation to test whether these choices improve prediction.

To review, here’s what we’ve been doing over the past few classes:

This class, we’ll learn about:

  1. Cross-validation: instead of relying on a single train/test split, we’ll use multiple splits of the data to get a more reliable measure of model performance. Here’s how it works:
    • Randomly assign each observation to a “fold” (numbers 1 through 5).
    • Save fold 1 to test the model; train the model on the other 4 folds.
    • Save fold 2 to test the model; train the model on the other 4 folds.
    • Save fold 3 to test the model; train the model on the other 4 folds.
    • Save fold 4 to test the model; train the model on the other 4 folds.
    • Save fold 5 to test the model; train the model on the other 4 folds.
    • Average the 5 resulting AUCs: this becomes the “Cross-Validation AUC” (CV AUC), which will be more trustworthy than the AUC on a single train/test split.
  2. Feature engineering: We’ll create new variables from the ones we already have, like logs, squares, or ratios, to help our models better capture patterns in the data.

Part 1: Cross-Validation

Based on the description of cross-validation above, fill in the blanks to write a function that does cross-validation to find the AUC five times across different train/test splits and then averages those five AUCs.

cross_validate <- function(data, molfe) {
  
  results <- map(
    .x = 1:___,
    .f = function(fold_num) {
      
      train <- data %>% filter(fold != ___)
      test  <- data %>% filter(fold == ___)
      
      predictions <- if (model == "lpm") {
        predict_lpm(train, test)
      } else if (model == "logit") {
        predict_logit(train, test)
      } else if (model == "knn") {
        predict_knn(train, test, k = 60)
      }
      
      predictions %>%
        make_roc() %>%
        auc()
    }
  )
  
  ___(unlist(results))
}

# Test: this should give you a single number for the CV AUC.
mortgage %>%
  select(approved, income, loan_amount) %>%
  mutate(fold = sample(1:5, size = nrow(mortgage), replace = T)) %>%
  cross_validate("lpm")

Part 2: Feature Engineering

Step 1: Estimate the Baseline Model

Let the “baseline” model be predictive but simple: it’s what a reasonable lender might use, avoiding demographic variables like age, sex, race, and ethnicity.

set.seed(1234)

# Notice below: piping into the curly brackets {} pipes the data into 3 branching functions: cross_validate(., "lpm"), cross_validate(., "logit"), and cross_validate(., "knn").
mortgage %>%
  # add variables you want to include in your model to the select statement:
  select(___) %>%
  mutate(fold = sample(1:5, size = nrow(.), replace = T)) %>%
  {
    c(
      lpm = cross_validate(., "lpm"),
      logit = cross_validate(., "logit"),
      knn = cross_validate(., "knn")
    )
  }
Baseline Model

Copy-paste the CV AUC for our three models for the baseline:

   lpm     logit       knn 
   ____________________________

Step 2: Interactions

Before estimating more complicated models, let’s consider the data carefully from a lender’s perspective. We want to create features (explanatory variables) that capture financially risky combinations, not just single risky variables. For example, a high loan amount may not be risky for a high-income borrower, and a high debt-to-income ratio may mean something different for a home purchase than for a cash-out refinance.

Interactions

Interactions ask the question: what combinations of variables make for better prediction? In this case, we want to include any interactions that could add up to a financially stressed borrower. Explain why each of these interactions might be predictive:

  1. loan_income_ratio = loan_amount / income

  2. cash_out_high_dti = (loan_purpose == "Cash-out refinance") * debt_to_income_ratio

  3. home_purchase_loan_income = (loan_purpose == "Home purchase") * loan_income_ratio

Add the interactions to the model and compare whether the CV AUC improved.

set.seed(1234)

mortgage %>%
  select(_____) %>%
  mutate(
    loan_income_ratio = ___,
    cash_out_high_dti = ___,
    home_purchase_loan_income = ___
  ) %>%
  mutate(fold = sample(1:5, size = nrow(.), replace = T)) %>%
  {
    c(
      lpm = cross_validate(., "lpm"),
      logit = cross_validate(., "logit"),
      knn = cross_validate(., "knn")
    )
  }
Interactions Model

Copy-paste the CV AUC for our three models when we include interactions:

   lpm     logit       knn 
   ____________________________

Step 3: Buckets

The next feature engineering task we’ll explore is to create buckets, which help capture threshold effects: situations where risk jumps when a variable crosses a certain level.

For example, a debt-to-income ratio crossing the threshold of 22% vs 23% might not matter, but 49% to 50% might matter a lot.

What threshold is most predictive for debt_to_income_ratio? What about loan_to_income?

set.seed(1234)

mortgage %>%
  select(___) %>%
  mutate(
    high_debt_to_income = debt_to_income_ratio > ___,
    high_loan_to_income = (loan_amount / income) > ___
  ) %>%
  mutate(fold = sample(1:5, size = nrow(.), replace = T)) %>%
  {
    c(
      lpm = cross_validate(., "lpm"),
      logit = cross_validate(., "logit"),
      knn = cross_validate(., "knn")
    )
  }
Buckets Model

Copy-paste the CV AUC for our three models when we include buckets:

   lpm     logit       knn 
   __________________________

Step 4: Logs and Squared Terms

The linear probability model assumes variables affect approval in a linear way; the logit assumes variables affect log odds in a linear way. If we want to account for the possibility of diminishing marginal returns, or variables to become risky at an increasing rate, we can use logs and squared terms:

  • Income may have diminishing marginal returns: going from $50k to $100k matters more than going from $200k to $250k
  • Very large loans may become risky at an increasing rate
  • Risk may increase faster at extreme values
Logs and Squared Terms

Which variables are more likely to have diminishing marginal returns, and which variables are more likely to be associated with risk that’s increasing at an increasing rate? Discuss.

Add these transformations to the model and compare whether the CV AUC improves.

set.seed(1234)

mortgage %>%
  select(___) %>%
  mutate(
    log_income = log(income),
    log_loan_amount = log(loan_amount + 1),
    log_property_value = log(property_value + 1),
    dti_sq = debt_to_income_ratio^2,
    ltv_sq = loan_to_value_ratio^2
  ) %>%
  mutate(fold = sample(1:5, size = nrow(.), replace = T)) %>%
  {
    c(
      lpm = cross_validate(., "lpm"),
      logit = cross_validate(., "logit"),
      knn = cross_validate(., "knn")
    )
  }

Step 5: Do we need more data?

We have 36553 observations in this mortgage data set (after cleaning and filtering), which represents families who applied for mortgages in 2024 in Riverside, CA. If we added data from 2023 as well, how much better would our predictions get?

Go to https://ffiec.cfpb.gov/data-browser/data/2023?category=counties&items=06065 - 2023 - Riverside - leave steps 2 and 3 blank Download and rename the file county_06065_2023.csv. At the top of this document where we define mortgage, replace the first line with:

mortgage <- read_csv("~/Downloads/county_06065.csv") %>%
  bind_rows(read_csv("~/Downloads/county_06065_2023.csv") %>% mutate(income = as.double(income)))
More Data

How does the CV AUC change for each of these models when we double the amount of data?

   lpm     logit       knn 
   __________________________
Review Questions
  1. In your own words, explain cross-validation. How is it better than evaluating a model on a single train/test split?

  2. When would you want to be sure to include interactions in a model? Did interactions help us a lot or only a little here?

  3. When would you want to be sure to include buckets in a model? Did buckets help us a lot or only a little here?

  4. When would you want to be sure to include logs and squared terms in a model? Did they help us a lot or only a little here?

  5. What are signs that you might need more data to train your model? Did doubling the amount of data help us a lot or only a little here?

Download this assignment

Here’s a link to download this assignment.