set.seed(123)
mortgage <- mortgage %>%
mutate(
train = sample(
c(0, 1),
size = ___,
prob = ___,
replace = ___
)
)
mortgage_train <- mortgage %>% filter(train == 1)
mortgage_test <- mortgage %>% filter(train == 0)
test_that("train/test split is created correctly", {
# 1) both are tibbles
expect_s3_class(mortgage_train, "tbl_df")
expect_s3_class(mortgage_test, "tbl_df")
# 2) about 80/20 split → ~4:1 ratio
ratio <- nrow(mortgage_train) / nrow(mortgage_test)
expect_true(ratio > 3 & ratio < 5)
})10 Bias/Variance Trade-off
1) Warm-up: What did we discover last time?
Read this and fill in the blanks:
- KNN makes a prediction by finding the
kmost similar observations and looking at their outcomes. For example, ifk = 5and 4 of the 5 nearest neighbors were approved, KNN predicts an approval probability of ______. - When
kis large, the KNN fit is smoother because each prediction averages over many neighbors. - When
kis small, the KNN fit is bumpier because each prediction depends on only a few neighbors. - Last time, as
kgot smaller, the AUC increased. - When
k = 1, the AUC was equal to ______. - But this happened because each observation was allowed to be its own nearest neighbor. In other words, the model memorized the data.
- This is why we should not evaluate a model on the same data we used to train it.
- Today, we will split the data into training data and test data. The model learns from the training data, but we evaluate it on the test data. Does KNN still have an advantage over the LPM and the logit?
2) Split the data randomly into 80% train/20% test
Fill in the blanks to create mortgage_train for training the model and mortgage_test for evaluating the model:
3) Evaluate the LPM, Logit, and KNN using the training vs test splits
Run the models and compute AUC.
- Which model performs best?
- Are the differences large or small?
# LPM AUC: ____
predict_lpm(____, ____) %>%
make_roc() %>%
plot_and_continue(title = "LPM") %>%
auc()
# Logit AUC: ____
predict_logit(____, ____) %>%
make_roc() %>%
plot_and_continue(title = "Logit") %>%
auc()
# KNN AUC with k = 60:
predict_knn(____, ____, k = 60) %>%
make_roc() %>%
plot_and_continue() %>%
auc()One other important adjustment for KNN I’ve added: KNN is based on distance, but our variables are measured in different units.
For example, income is measured in thousands of dollars, loan amount is measured in thousands of dollars, and race/sex/loan type become dummy variables equal to 0 or 1. If we do not adjust the scale, variables with larger numbers can dominate the distance calculation.
The solution is to standardize each numeric variable: \[\frac{x - \text{mean}(x)}{\text{sd}(x)}\]
In R, scale() does this for us. After scaling, each variable is measured in standard deviation units, so the distance calculation treats the variables more fairly. Find where I’ve used scale() in the function predict_knn.
4) Search over values of k
Your goal: find the value of k that maximizes test AUC.
Plot k on the x-axis and AUC on the y-axis. Consider: does KNN ever beat the LPM or Logit in terms of AUC?
k_values <- c(5, 10, ___, ___, ___, ___, ___, ___)
knn_results <- tibble(
k = k_values,
AUC = map(
.x = k_values,
.f = ~ predict_knn(mortgage_train, mortgage_test, k = .x) %>%
make_roc() %>%
auc()
) %>%
as_vector()
)
# Visualize k versus AUC:
knn_results %>%
ggplot(___) +
___
# Find the k that maximizes the AUC:
knn_results %>%
___5) A case where KNN shines
KNN is very flexible, but flexibility only helps when there’s a complex pattern to learn. Take, for instance, a wave-like pattern for income and approval like sin(x). The LPM and Logit can only capture simple, smooth relationships, so they miss the wave pattern. KNN can do much better because it uses nearby observations instead of forcing one global curve.
set.seed(123)
income <- tibble(
income = runif(1000, 0, 800),
approval_prob = 0.5 + 0.35 * sin(income / 60),
approved = rbinom(1000, size = 1, prob = approval_prob),
train = sample(0:1, size = 1000, prob = c(.2, .8), replace = T)
)
income %>%
ggplot(aes(x = income, y = approved)) +
geom_jitter(height = .05, alpha = .4) +
geom_line(aes(y = approval_prob), color = "red", linewidth = 1)
income <- income %>% select(-approval_prob)
income_train <- income %>% filter(train == 1)
income_test <- income %>% filter(train == 0)Let’s see how the Logit and LPM do at this sin(x) task:
# Logit AUC: ____
predict_logit(____, ____) %>%
make_roc() %>%
plot_and_continue() %>%
auc()
# LPM AUC: ____
predict_lpm(____, ____) %>%
make_roc() %>%
plot_and_continue() %>%
auc()Now let’s see how KNN does at this sin(x) task:
# KNN AUC: ____
predict_knn(___, ___, k = 30) %>%
make_roc() %>%
plot_and_continue() %>%
auc()6) Bias/Variance Tradeoff with KNN
One of the most important ideas in machine learning is the bias/variance tradeoff.
Bias means the model is too simple. A high-bias model misses real patterns in the data. It smooths over important structure and tends to make similar predictions for everyone.
Variance means the model is too sensitive. A high-variance model reacts too strongly to small details in the training data. It can chase noise instead of learning the true pattern.
In KNN, the value of k controls this tradeoff.
- When
kis very small, each prediction depends on only a few neighbors. The model becomes very flexible, but also very noisy. This is low bias and high variance. - When
kis very large, each prediction averages over many neighbors. The model becomes very smooth, but it may miss real patterns. This is high bias and low variance.
The goal is not to make the model as flexible as possible. The goal is to make it flexible enough to learn real patterns, but not so flexible that it memorizes noise.
Compared to KNN, do you think LPM and Logit are usually higher-bias or higher-variance models?
# Visualizing Model Variance:
plot_knn_income <- function(k_value) {
mortgage %>%
slice_sample(n = 1000) %>%
mutate(
prediction = map_dbl(
income,
function(x) {
mortgage %>%
mutate(distance = abs(income - x)) %>%
slice_min(distance, n = k_value) %>%
summarize(prediction = mean(approved)) %>%
pull(prediction)
}
)
) %>%
arrange(income) %>%
ggplot(aes(x = income, y = approved)) +
geom_jitter(height = .025, width = 0, alpha = .2, size = .5) +
geom_line(aes(y = prediction), linewidth = 1, color = "red") +
labs(
title = paste("KNN predictions using income, k =", k_value),
x = "Income",
y = "Approved"
)
}
# Run each of these several times.
# If the curve changes a lot, that's a high variance model.
plot_knn_income(5)
plot_knn_income(100)
plot_knn_income(500)Why do we use training and test sets?
Write code to split a dataset into training and test sets in R.
How does map() help us choose the best value of k in KNN?
When should you expect KNN to outperform LPM and Logit?
Consider: A model is trained on one dataset, then retrained on a slightly different dataset. Its predictions change a lot, but on average it captures the true pattern well. Is this a high or low variance model? Is it a high or low bias model? Which of the 3 models does this remind you of from class?
A model is trained on one dataset, then retrained on a slightly different dataset. Its predictions stay very similar, but they systematically miss the true pattern. Is this a high or low variance model? Is it a high or low bias model? Which of the 3 models does this remind you of from class?
Download this assignment
Here’s a link to download this assignment.