---
title: "17 Fairness Metrics"
format:
  html:
    self-contained: TRUE
---

A human loan officer can discriminate, consciously or not, treating applicants differently based on race, gender, or zip code. A machine learning model can do the same thing, absorbing patterns of past discrimination from historical data and reproducing them at scale. The key difference is that a model is always consistent: it applies the same rule to every applicant, every time, which makes its behavior much easier to study and measure.

**Fairness metrics** are the tools we use to do that measuring. They translate vague intuitions like "the model should treat everyone equally" into precise, testable conditions.

## Part 1: Defining Fairness

**Q1.** The model approves 70% of White applicants and 50% of Black applicants. Just looking at those numbers, what might you conclude?

**Answer here.**

**Q2.** Now you learn that Black applicants in this data set have lower average incomes and credit scores than White applicants. Does that make the approval rate gap more acceptable, less acceptable, or does it not change your view? Explain your reasoning.

**Answer here.**

**Q3.** Where do those income and credit score gaps come from? If they are partly the result of a history of discriminatory hiring, housing, and lending policy, does that change your answer to Q2?

**Answer here.**

**Q4.** Suppose we decided that a fair model is simply one where the approval rate is the same across racial groups, no matter what. Write that idea as a precise condition using math or code. This is Fairness Criterion 1: **Independence**, also known as demographic parity.

**Answer here.**

**Q5.** Now think about a specific type of applicant: someone who would have repaid their loan on time if approved, but the model rejects them anyway. This kind of mistake is a (true/false) (positive/negative). Who is harmed the most by this kind of mistake?

**Answer here.**

**Q6.** Suppose the mistake from Q5 happens twice as often for Black applicants who would have repaid as for White applicants who would have repaid. Is this fair?

**Answer here.**

**Q7.** Write a precise condition to check whether the mistake from Q5 is happening at equal rates across racial groups. Express it using math or code.

**Answer here.**

**Q8.** In the mortgage application problem, what does a false positive look like? Should the rates of false positives be equal across racial groups, and why?

**Answer here.**

Together, Q7 and Q8 define **Separation**, also known as equalized odds: the model should make equally many errors of each type across groups, once we account for who truly would have repaid.

To check Separation, we need to know the true outcome: did the applicant actually repay their loan or not? The mortgage data set doesn't give us that. From here we switch to the recidivism data set, where we do have the true outcome for every defendant.

```{r, eval = F}
install.packages("fairmodels")
```

```{r, echo = F, message = F, warning = F}
# Here we'll load the data, split into training and test
# sets, and fit a random forest model.

library(tidyverse)
library(randomForest)
library(fairmodels)

data("compas")

crime <- as_tibble(compas) %>%
  rename(
    reoffended  = Two_yr_Recidivism,
    priors      = Number_of_Priors,
    over_45     = Age_Above_FourtyFive,
    under_25    = Age_Below_TwentyFive,
    misdemeanor = Misdemeanor,
    race        = Ethnicity,
    sex         = Sex
  ) %>%
  mutate(
    reoffended = if_else(as.double(reoffended) == 1, 0, 1),
    race = fct_relevel(race, "Caucasian"),
    sex  = factor(sex, levels = c("Male", "Female"))
  )

set.seed(1234)
crime <- crime %>%
  mutate(train = sample(0:1, nrow(.), replace = TRUE, prob = c(0.2, 0.8)))

crime_train <- crime %>% filter(train == 1)
crime_test  <- crime %>% filter(train == 0)

crime_rf <- randomForest(
  factor(reoffended) ~ . - train,
  data  = crime_train,
  ntree = 100,
  mtry  = 3
)

crime_predictions <- crime_test %>%
  mutate(
    prediction = predict(crime_rf, newdata = crime_test, type = "prob")[, 2],
    classifier = if_else(prediction >= 0.5, 1, 0)
  ) %>%
  select(-train)
```

The COMPAS algorithm was used by courts across the United States to predict whether a defendant would reoffend within two years. Judges used these predictions to inform decisions about bail, sentencing, and parole. In 2016, the investigative journalism organization ProPublica analyzed COMPAS scores and actual reoffending records for defendants in Broward County, Florida.

This data set is unusual and valuable: it contains the **true outcome**, whether each defendant genuinely reoffended within two years. That lets us check what a model might predict, and then check whether those predictions will be fair across racial groups.

## Part 2: Does the Model Satisfy Independence?

Recall that **Independence** requires the flagging rate to be the same across racial groups.

**Q9.** Use `crime_predictions` to evaluate whether independence holds. Are flagging rates the same across racial groups?

```{r}
crime_predictions %>%
  ____
```

Looking at these values tells us whether the rates differ, but not whether those differences are large enough to rule out chance. We can use a linear regression to test whether group means are really different.

```{r}
lm(classifier ~ race, data = crime_predictions) %>%
  broom::tidy()
```

**Q10.** Interpret what the regression output says about independence.

**Answer here.**

The **80% rule** says that if one group's rate of being predicted not to reoffend falls below 80% of the highest group's rate, that signals a potential problem. It's used in U.S. employment law, and is a great example of a policy choice masquerading as a statistical one, that lets everyone pretend there's an objective answer. Still, it highlights the difference between statistical significance and economic significance: 80% and 81% can be statistically significant if there's enough data, but it's probably too small of a difference to actually be meaningful.

**Q11.** Does the flagging rate for African Americans fall below the 80% threshold compared to Caucasians?

```{r}
crime_predictions %>%
  ____
```

## Part 3: Does the Model Satisfy Separation?

Recall that **Separation** requires two conditions to hold simultaneously across racial groups: the false positive rate must be equal, and the false negative rate must be equal.

**Q12.** Interpret the differences in FPR and FNR between Caucasians and African Americans. What do they mean, and who gets hurt by these differences in each kind of mistake rates?

```{r}
crime_predictions %>%
  group_by(race) %>%
  summarize(
    FPR = sum(reoffended == 0 & classifier == 1) / sum(reoffended == 0),
    FNR = sum(reoffended == 1 & classifier == 0) / sum(reoffended == 1),
    n = n()
  )
```

**Q13.** Let's conduct formal hypothesis tests using lm() to see if the FPR and FNR are equal across groups. Do these tests point to separation being violated here? Explain.

```{r}
# FPR regression: among those who did not reoffend, 
# does race predict being wrongly flagged?
crime_predictions %>%
  filter(reoffended == 0) %>%
  lm(classifier ~ race, data = .) %>%
  broom::tidy()

# FNR regression: among those who did reoffend, 
# does race predict being missed?
crime_predictions %>%
  filter(reoffended == 1) %>%
  mutate(missed = 1 - classifier) %>%
  lm(missed ~ race, data = .) %>%
  broom::tidy()
```

ProPublica's critique of COMPAS focused on the FPR being roughly twice as high for Black defendants, which is a violation of Separation. The company that built COMPAS responded that their model satisfied a the third fairness criterion: *Calibration*.

**Calibration** asks whether the same predicted score means the same thing across groups. If the model gives a defendant a 70% chance of reoffending, that probability should be equally accurate for Black and White defendants. We can check for calibration by asking: once we control for the model's predicted score, does race still predict who actually reoffended?

```{r}
lm(reoffended ~ prediction + race, data = crime_predictions) %>%
  broom::tidy()
```

**Q14.** Interpret the coefficient from the regression above on `raceAfrican_American`. It says, once we control for the model's prediction, race (does/does not) have statistically significant predictive power over whether the person will reoffend.

**Answer here.**

Models tend to satisfy calibration automatically because it is what they are optimized to do: minimize prediction error across all observations. A well-trained model that assigns a 70% reoffending probability to a defendant should be right about 70% of the time regardless of race, simply because the model is trying to be as accurate as possible for everyone.

## Part 4: Practice Problems

**Q15.** A tech company uses a model to screen job applications. The model advances 40% of male applicants and 25% of female applicants to the interview stage.

a. Which fairness criterion is certainly being violated here?

**Answer here.**

b. Apply the 80% rule. Does the gap signal a potential problem?

**Answer here.**

c. The company argues that male applicants in their data set have more years of relevant experience on average. Does that settle the question of fairness?

**Answer here.**

**Q16.** A model screens patients for a disease. Here are the results:

| | Group A | Group B |
|--|--|--|
| Truly have disease ($Y=1$) | 500 | 500 |
| Correctly identified | 400 | 300 |
| Truly healthy ($Y=0$) | 2000 | 2000 |
| Wrongly flagged | 200 | 100 |

a. Compute the FPR (wrongly flagged / truly healthy) and FNR (missed / truly have the disease) for each group.

**Answer here.**

b. Does the model satisfy Separation? Which group bears the higher cost, and what is that cost?

**Answer here.**

## Download this assignment

Here's a [link](downloads/CW17-fairness-metrics.qmd) to download this assignment.