Quiz 3 Practice Test

Here are some practice questions to help you prepare for Quiz 3 (first attempt: Friday, May 8; retake: Friday, May 15). The first attempt will focus on these concepts using the job applications data set, while the retake will cover the same ideas using the criminal recidivism data set.

1) K Nearest Neighbors (5 points)

Explain what the K Nearest Neighbors model is and how it works.

Answer: KNN works by asking: for each observation, what are its \(k\) closest neighbors, and how are they labeled? For example, to find the KNN prediction for the application of a 30 year old white couple making 130K/year, KNN (with k = 5) finds the approval rate of the five applications closest to that couple’s application (5 other 30-ish white couples making around 130K/year).

Fill in the missing pieces to implement K Nearest Neighbors from scratch:

# We'll find the 5 nearest neighbors for each data point.
k <- 5

# Let's start with a simple model of approved ~ income. Grab the first observation's income (pull() takes a tibble variable and returns a vector):
i <- mortgage %>% 
  slice(1) %>% 
  pull(income)

# Find that person's k nearest neighbors:
mortgage %>% 
  select(income, approved) %>%
  ___(distance = abs(income - ___)) %>%
  slice_min(n = k, with_ties = T, order_by = distance) %>%
  ___(prediction = ___)

# The map() version to find the KNN prediction for each observation, and then visualize:
mortgage %>%
  slice_sample(n = 1000) %>% # limit to 1000 because KNN is expensive
  mutate(
    prediction = map(
      .x = income,
      .f = ~ mortgage %>% 
        ___(distance = abs(income - ___)) %>%
        slice_min(n = k, with_ties = T, order_by = distance) %>%
        ___(prediction = ___) %>%
        pull(prediction)
    ) %>% as_vector()
  ) %>% 
  ggplot(aes(x = income, y = approved)) +
  geom_jitter(alpha = 0.2, height = .025, width = 0, size = .5) +
  geom_point(aes(y = prediction), color = "red", alpha = 0.4) +
  labs(
    title = "KNN predictions using income",
    x = "Income",
    y = "Approved"
  )

When should you expect that KNN outperforms other models?

Answer: when the probability of approval follows a complicated nonlinear pattern, KNN has an advantage over the LPM and the logit. For example, KNN may perform better when approval probabilities follow patterns that look like sine or cosine waves rather than straight lines.

The lower \(k\) is in KNN, the (choose one: lower/higher) the variance of the model.

Answer: higher.

Suppose we use the same data for training the model as we use to test it. When we let k = 1, what should we expect the AUC will be? Explain.

Answer: we’ll get AUC = 1: it looks like a perfect fit with FPR = 0 and TPR = 1. But in this case, our model has simply memorized the data set: each observation’s single nearest neighbor is itself, and the prediction is whether or not they were themselves approved. This is overfitting to the extreme: with new data, the model would perform much worse. This is the reason we don’t use the same data set to train and to test the model.

2) Train/test split (1 point)

Write code to split the data set into training and test sets in R.

Answer:

mortgage <- mortgage %>%
  mutate(train = sample(c(0, 1), size = nrow(.), prob = c(.2, .8), replace = TRUE))

mortgage_train <- mortgage %>% filter(train == 1)
mortgage_test <- mortgage %>% filter(train == 0)

3) Search over values of k (2 points)

Write code to let k be each of c(5, 10, 20, 40, 80, 200) to look for the k that maximizes test AUC.

k_values <- c(5, 10, 20, 40, 80, 200)

map(
  .x = ____,
  .f = ~ predict_knn(mortgage_train, mortgage_test, k = ____) %>% 
    make_roc() %>% 
    auc()
)

How does map() help us choose the best value of k in KNN?

4) Bias/Variance (2 points)

Consider: A model is trained on one data set, then retrained on a slightly different data set. Its predictions change a lot, but on average it captures the true pattern well. Is this a high or low variance model? Is it a high or low bias model? Which of the 3 models does this remind you of from class?
A model is trained on one data set, then retrained on a slightly different data set. Its predictions stay very similar, but they systematically miss the true pattern. Is this a high or low variance model? Is it a high or low bias model? Which of the 3 models does this remind you of from class?

5) Cross Validation (2 points)

Explain what cross-validation is and how to do it. How is it better than evaluating a model on a single train/test split?
Fill in the blanks to write the function cross_validate:

cross_validate <- function(data, model) {
  
  results <- map(
    .x = 1:___,
    .f = function(___) {
      
      train <- data %>% filter(____)
      test  <- data %>% filter(____)
      
      predictions <- if (model == "lpm") {
        predict_lpm(train, test)
      } else if (model == "logit") {
        predict_logit(train, test)
      } else if (model == "knn") {
        predict_knn(train, test, k = 60)
      }
      
      predictions %>%
        make_roc() %>%
        auc()
    }
  )
  
  ___(unlist(___))
}

# Test the function: this should give you a single number for the CV AUC.
mortgage %>%
  select(approved, income, loan_amount) %>%
  mutate(fold = sample(1:5, size = nrow(mortgage), replace = T)) %>%
  cross_validate("lpm")

6) Feature Engineering (4 points)

Fill in the blanks: Consider the variable income. The best choice would be a _____ transformation because income has _____: earning an extra 10K/year would boost your chances of getting approved more when you earn 60K/year than when you earn 300K/year.

Answer: log; diminishing marginal returns

Consider the variable debt_to_income_ratio. The best choice would be a _____ transformation because debt_to_income_ratio has _____: going from a debt-to-income ratio of 10% to 20% may not change approval chances very much, but going from 50% to 60% could sharply reduce the chances of getting approved.

Answer: square; increasing marginal effects

Why would we want to include this variable in our model? (loan_purpose == "Cash-out refinance") * debt_to_income_ratio

Answer: High debt-to-income ratios could be especially risky for borrowers taking cash out of their home equity, so this interaction allows the model to give extra penalty to high debt levels for this specific group.

Why would we want to include this variable in our model? home_purchase_loan_income = (loan_purpose == "Home purchase") * loan_income_ratio

Answer: Borrowers buying a home may be viewed differently by lenders, so the model can allow loan size relative to income to matter more (or less) for this group than for other loan purposes.