---
title: "9 K Nearest Neighbors"
format:
  html:
    self-contained: TRUE
---

In this assignment, you will:

- Refactor your code from last class to calculate ROC curves and AUC for the LPM and the Logit model.
- Meet Model #3: K-Nearest Neighbors (KNN), a completely different (nonparametric!) way of thinking about prediction.
- Build KNN from scratch, so you can see exactly what it’s doing under the hood.
- Learn how to use `FNN::knn` to scale that idea up to lots and lots of features (variables).
- Stress test KNN by changing the value of `k` and tracking how performance changes.
- Discover something a little unsettling: a model can look "perfect"... without actually learning anything. We'll discuss this in depth on Thursday.

## Setup

```{r installs, eval = F}
# Run these lines only one time to install these new packages.
install.packages("fastDummies")
install.packages("FNN")
install.packages("testthat")
```

```{r setup, echo = F, message = F, warning = F}
library(tidyverse)
library(fastDummies)
library(FNN)
library(testthat)

# You may need to change the file path (this one is for macs)
mortgage <- read_csv("~/Downloads/county_06065.csv") %>%
  filter(
    business_or_commercial_purpose == 2,
    total_units == 1,
    derived_dwelling_category == "Single Family (1-4 Units):Site-Built",
    action_taken %in% c(1, 2, 3),
    income >= 0,
    income <= 800
  ) %>%
  mutate(
    approved = if_else(action_taken %in% c(1, 2), 1, 0),

    loan_purpose = case_when(
      loan_purpose == 1  ~ "Home purchase",
      loan_purpose == 2  ~ "Home improvement",
      loan_purpose == 31 ~ "Refinance",
      loan_purpose == 32 ~ "Cash-out refinance",
      loan_purpose == 4  ~ "Other",
      .default = NA_character_
    ),

    derived_ethnicity = case_when(
      derived_ethnicity == "Hispanic or Latino"     ~ "Hispanic",
      derived_ethnicity == "Not Hispanic or Latino" ~ "Not Hispanic",
      derived_ethnicity == "Joint"                  ~ "Joint",
      .default = NA_character_
    ),

    derived_sex = na_if(derived_sex, "Sex Not Available"),

    derived_race = case_when(
      derived_race == "White"                      ~ "White",
      derived_race == "Black or African American" ~ "Black",
      derived_race == "Asian"                     ~ "Asian",
      derived_race == "Joint"                     ~ "Joint",
      derived_race %in% c(
        "American Indian or Alaska Native",
        "Native Hawaiian or Other Pacific Islander",
        "2 or more minority races"
      ) ~ "Other",
      .default = NA_character_
    ),

    applicant_age = na_if(applicant_age, "8888"),

    debt_to_income_ratio = case_when(
      debt_to_income_ratio == "<20%" ~ "15",
      debt_to_income_ratio == "20%-<30%" ~ "25",
      debt_to_income_ratio == "30%-<36%" ~ "33",
      debt_to_income_ratio == "50%-60%" ~ "55",
      debt_to_income_ratio == ">60%" ~ "65",
      debt_to_income_ratio %in% as.character(36:49) ~ debt_to_income_ratio,
      .default = NA_character_
    ),

    loan_type = case_when(
      loan_type == 1 ~ "Conventional",
      loan_type == 2 ~ "FHA",
      loan_type == 3 ~ "VA",
      loan_type == 4 ~ "USDA/FSA",
      .default = NA_character_
    ),
    
    underwriter = case_when( 
      `aus-1` == 1 ~ "Desktop Underwriter",
      `aus-1` == 2 ~ "Loan Prospector",
      `aus-1` == 3 ~ "Technology Open to Approved Lenders (TOTAL)",
      `aus-1` == 4 ~ "Guaranteed Underwriting System (GUS)",
      `aus-1` == 5 ~ "Other",
      `aus-1` == 6 ~ "Not applicable",
      `aus-1` == 7 ~ "Internal Proprietary System",
      .default = NA_character_
    ),

    lien_status = case_when(
      lien_status == 1 ~ "First lien",
      lien_status == 2 ~ "Subordinate lien",
      .default = NA_character_
    ),

    occupancy_type = case_when(
      occupancy_type == 1 ~ "Principal residence",
      occupancy_type == 2 ~ "Second residence",
      occupancy_type == 3 ~ "Investment property",
      .default = NA_character_
    ),

    loan_amount = loan_amount / 1000,
    property_value = as.numeric(property_value) / 1000,
    debt_to_income_ratio = as.numeric(debt_to_income_ratio),
    loan_term = as.numeric(loan_term),
    interest_rate = as.numeric(interest_rate),

    loan_purpose = factor(loan_purpose),
    derived_ethnicity = factor(derived_ethnicity),
    derived_race = factor(derived_race),
    derived_sex = factor(derived_sex),
    applicant_age = factor(
      applicant_age,
      levels = c("<25", "25-34", "35-44", "45-54", "55-64", "65-74", ">74")
    ),
    conforming_loan_limit = factor(conforming_loan_limit),
    loan_type = factor(loan_type),
    lien_status = factor(lien_status),
    occupancy_type = factor(occupancy_type),
    underwriter = factor(underwriter),
    loan_to_value_ratio = as.double(loan_to_value_ratio)
  ) %>%
  select(
    approved,
    income,
    debt_to_income_ratio,
    race = derived_race,
    ethnicity = derived_ethnicity,
    sex = derived_sex,
    age = applicant_age,
    loan_amount,
    loan_type,
    loan_purpose,
    loan_to_value_ratio,
    loan_term,
    underwriter,
    lien_status,
    conforming_loan_limit,
    property_value,
    occupancy_type
  ) %>%
  mutate(
    race = fct_relevel(race, "White"),
    ethnicity = fct_relevel(ethnicity, "Not Hispanic"),
    sex = fct_relevel(sex, "Joint"),
    age = fct_relevel(age, "35-44"),
    loan_type = fct_relevel(loan_type, "Conventional"),
    loan_purpose = fct_relevel(loan_purpose, "Home purchase"),
    
    lien_status = fct_relevel(lien_status, "First lien"),
    conforming_loan_limit = fct_relevel(conforming_loan_limit, "C"),
    occupancy_type = fct_relevel(occupancy_type, "Principal residence")
  ) %>%
  drop_na() %>%
  distinct()
```

```{r helpers}
# ---- predict_lpm() -----
# Let `predict_lpm` take a data set and return that same
# data set, only with 2 columns: `approved` and `prediction`,
# where prediction is the LPM prediction for each observation.

predict_lpm <- function(data) {
  ____
}

test_that("predict_lpm works and returns correct first row", {
  # Check it doesn't come up with an error:
  expect_no_error(mortgage %>% predict_lpm())
  result <- mortgage %>% predict_lpm()
  # Check that result exists and is a tibble:
  expect_s3_class(result, "data.frame")
  # Check that the names are correct:
  expect_named(result, c("approved", "prediction"))
  # Check the values in the first row:
  expect_equal(result$approved[1], 1)
  expect_equal(result$prediction[[1]], 0.885, tolerance = 1e-3)
})

# ---- predict_logit() -----
# Let `predict_logit` take a data set and return that same
# data set, only with 2 columns: `approved` and `prediction`,
# where prediction is the logit prediction for each observation.

predict_logit <- function(data) {
  ____
}

test_that("predict_logit works and returns correct first row", {
  # Check it doesn't come up with an error:
  expect_no_error(mortgage %>% predict_logit())
  result <- mortgage %>% predict_logit()
  # Check that result exists and is a tibble:
  expect_s3_class(result, "data.frame")
  # Check that the names are correct:
  expect_named(result, c("approved", "prediction"))
  # Check the values in the first row:
  expect_equal(result$approved[1], 1)
  expect_equal(result$prediction[[1]], 0.910, tolerance = 1e-3)
})

# Run this to use confusion():
confusion <- function(data, threshold) {
  data %>%
    mutate(
      classifier = if_else(prediction >= threshold, 1, 0)
    ) %>%
    drop_na(classifier) %>%
    summarise(
      TP = sum(approved == 1 & classifier == 1),
      TN = sum(approved == 0 & classifier == 0),
      FP = sum(approved == 0 & classifier == 1),
      FN = sum(approved == 1 & classifier == 0)
    ) %>%
    mutate(
      TPR = TP / (TP + FN),
      FPR = FP / (FP + TN),
      accuracy = (TP + TN) / (TP + TN + FN + FP),
      threshold = threshold
    ) %>%
    select(threshold, TPR, FPR, accuracy)
}


# ---- make_roc() -----
# Let make_roc be a function that takes a data set
# of predictions, and then iterates over each 
# threshold between 0 and 1, calling confusion(data, threshold = .x)
# for each threshold level. make_roc should return a tibble
# of results from confusion(data, threshold = .x).

make_roc <- function(data) {
  ___
}

test_that("make_roc works and returns correct first row", {
  # Check it doesn't come up with an error:
  expect_no_error(mortgage %>% predict_lpm() %>% make_roc())
  result <- mortgage %>% predict_lpm() %>% make_roc()
  # Check that result exists and is a tibble:
  expect_s3_class(result, "data.frame")
  # Check that the names are correct:
  expect_named(result, c("threshold", "TPR", "FPR", "accuracy"))
  # Check that there are 101 rows:
  expect_equal(nrow(result), 101)
})

# ---- auc() -----
# Let `auc()` be a function that takes the output of make_roc()
# ("roc_data") with variables threshold, TPR, FPR, and accuracy,
# and use the trapezoid method to find the area under the curve,
# finally outputting a value for the AUC.

auc <- function(roc_data) {
  roc_data %>%
    arrange(FPR, TPR) %>% 
    mutate(____) %>%
    summarize(___) %>%
    pull(AUC)
}

test_that("auc works and returns correct first row", {
  # Check it doesn't come up with an error:
  expect_no_error(mortgage %>% predict_lpm() %>% make_roc() %>% auc())
  result <- mortgage %>% predict_lpm() %>% make_roc() %>% auc()
  # Check that result exists and is the correct vector:
  expect_equal(result, .788, tolerance = .001)
})

# Run this to use `plot_and_continue()`, which takes data,
# plots it as a side effect, and returns that data so you
# can continue on with your query.
plot_and_continue <- function(data, x, y, title) {
  print(
    data %>%
      ggplot(aes(x = {{x}}, y = {{y}})) +
      geom_line() +
      geom_point() +
      labs(title = title) +
      xlim(0, 1.1) +
      ylim(0, 1.1)
  )
  data
}
```

## 1) Finish writing the helper functions above

After you do that, the code chunk below should work to visualize the ROC curve for the LPM and the logit, along with calculating the AUC.

```{r lpm-logit}
# LPM:
mortgage %>%
  predict_lpm() %>%
  make_roc() %>%
  plot_and_continue(x = FPR, y = TPR, title = "LPM ROC Curve") %>%
  auc()

# Logit:
mortgage %>%
  predict_logit() %>%
  make_roc() %>%
  plot_and_continue(x = FPR, y = TPR, title = "Logit ROC Curve") %>%
  auc()
```

## 2) Model 3: K-Nearest Neighbors

So far, we've learned about the linear probability model (linear regression) and the logistic regression (Logit). Now I'll introduce a third model: **K-Nearest Neighbors (KNN)**.

**KNN works by asking: for each observation, what are its $k$ closest neighbors, and how are they labeled?** For example, to find the KNN prediction for the application of a 30 year old white couple making 130K/year, KNN (with k = 5) finds the approval rate of the five applications closest to that couple's application (5 other 30-ish white couples making around 130K/year).

KNN is a *nonparametric* model: there are no parameters ($\beta$) to estimate. This is in contrast to the LPM and the Logit, which are both *parametric* models (for both of those, we had $\beta_0$, $\beta_1$, $\beta_2$, ... to estimate).

Being nonparametric is both a strength and a weakness:

- Strength: Without parameters, you don't make any assumptions about the functional form of the data generating process (like "linear" or "linear in log odds"), so you can’t get the functional form wrong.

- Weakness: you don’t get parameter estimates, so you can't do inference to understand the role of one variable or the other: you can never say with KNN "a one-unit increase in X leads to a $\beta$-unit increase in Y on average". KNN lets you do prediction in a flexible way, but not inference.

Fill in the missing pieces to implement K Nearest Neighbors from scratch:

```{r knn}
# We'll find the 5 nearest neighbors for each data point.
k <- 5

# Let's start with a simple model of approved ~ income. Grab the first observation's income (pull() takes a tibble variable and returns a vector):
i <- mortgage %>% 
  slice(1) %>% 
  pull(income)

# Find that person's k nearest neighbors:
mortgage %>% 
  select(income, approved) %>%
  ___(distance = abs(income - i)) %>%
  slice_min(n = k, with_ties = T, order_by = distance) %>%
  ___(prediction = mean(___))

# The map() version to find the KNN prediction for each observation:
mortgage %>%
  slice_sample(n = 1000) %>% # limit to 1000 because KNN is expensive
  mutate(
    prediction = map(
      .x = income,
      .f = ~ mortgage %>% 
        ___(distance = abs(income - .x)) %>%
        slice_min(n = k, with_ties = T, order_by = distance) %>%
        ___(prediction = mean(___)) %>%
        pull(prediction)
    ) %>% as_vector()
  ) %>% 
  ggplot(aes(x = income, y = approved)) +
  geom_jitter(alpha = 0.2, height = .025, width = 0, size = .5) +
  geom_point(aes(y = prediction), color = "red", alpha = 0.4) +
  labs(
    title = "KNN predictions using income",
    x = "Income",
    y = "Approved"
  )
```

Recall the "income puzzle" from chapter 6. In `approved ~ income`, income seems to do well to predict `approved` (higher income means a higher probability of being approved). But when more variables are added, the sign switches: a higher income means a lower probability of being approved! Given the plot above, why do you think that was?

::: {.callout-note appearance="simple"}
## Your Answer


:::

## 3) KNN with `FNN::knn()`

Now we'll write a function `predict_knn` that is similar to `predict_lpm` and `predict_logit`: it should take a data set and a value for k (number of nearest neighbors to consider for the prediction), and return a tibble with `approved` and `prediction` according to the KNN model.

```{r fnnknn}
predict_knn <- function(data, k = 5) {
  # KNN is a little computationally expensive. For this reason, we'll sample 1000 rows randomly to do the whole procedure on.
  set.seed(1234)
  
  data_small <- data %>%
    slice_sample(n = 1000) %>%
    fastDummies::dummy_cols(
      remove_first_dummy = TRUE,
      remove_selected_columns = TRUE
    )

  knn_fit <- FNN::knn(
    train = data_small %>% select(-approved),
    test = data_small %>% select(-approved),
    cl = data_small %>% pull(approved) %>% factor(),
    k = k,
    prob = TRUE
  )

  data_small %>%
    mutate(
      knn = knn_fit,
      prob_attr = attr(knn_fit, "prob"),
      prediction = if_else(knn == "1", prob_attr, 1 - prob_attr)
    ) %>%
    select(approved, prediction)
}
```

## 4) KNN ROC curve and AUC

Fill in the blanks:

- When k = 50, the AUC is: ____
- When k = 20, the AUC is: ____
- When k = 10, the AUC is: ____
- When k = 5, the AUC is: ____
- When k = 3, the AUC is: ____
- When k = 2, the AUC is: ____
- When k = 1, the AUC is: ____

```{r}
# KNN
mortgage %>%
  ___(k = 50) %>%
  ___() %>%
  plot_and_continue(x = FPR, y = TPR, title = "KNN ROC Curve") %>%
  auc()
```

::: {.callout-note appearance="simple"}
## Interpret

- What does it mean when k = 50 vs k = 5 vs k = 1?


- What does a higher AUC mean?


- What happens to the AUC when k decreases?


- When k = 1, the AUC is ___. Does that mean we've reached our goal of uncovering the model that mortgage companies use to approve or deny applicants? Is the model "learning", or is it just memorizing a data set? What do you think would happen if we gave it new data it has never seen before?


:::