17 Fairness Metrics

A human loan officer can discriminate, consciously or not, treating applicants differently based on race, gender, or zip code. A machine learning model can do the same thing, absorbing patterns of past discrimination from historical data and reproducing them at scale. The key difference is that a model is always consistent: it applies the same rule to every applicant, every time, which makes its behavior much easier to study and measure.

Fairness metrics are the tools we use to do that measuring. They translate vague intuitions like “the model should treat everyone equally” into precise, testable conditions.

Part 1: Defining Fairness

Q1. The model approves 70% of White applicants and 50% of Black applicants. Just looking at those numbers, what might you conclude?

Q2. Now you learn that Black applicants in this data set have lower average incomes and credit scores than White applicants. Does that make the approval rate gap more acceptable, less acceptable, or does it not change your view? Explain your reasoning.

Q3. Where do those income and credit score gaps come from? If they are partly the result of a history of discriminatory hiring, housing, and lending policy, does that change your answer to Q2?

Q4. Suppose we decided that a fair model is simply one where the approval rate is the same across racial groups, no matter what. Write that idea as a precise condition using math or code. This is Fairness Criterion 1: Independence, also known as demographic parity.

Q5. Now think about a specific type of applicant: someone who would have repaid their loan on time if approved, but the model rejects them anyway. This kind of mistake is a (true/false) (positive/negative). Who is harmed the most by this kind of mistake?

Q6. Suppose the mistake from Q5 happens twice as often for Black applicants who would have repaid as for White applicants who would have repaid. Is this fair?

Q7. Write a precise condition to check whether the mistake from Q5 is happening at equal rates across racial groups. Express it using math or code.

Q8. In the mortgage application problem, what does a false positive look like? Should the rates of false positives be equal across racial groups, and why?

Together, Q7 and Q8 define Separation, also known as equalized odds: the model should make equally many errors of each type across groups, once we account for who truly would have repaid.

To check Separation, we need to know the true outcome: did the applicant actually repay their loan or not? The mortgage data set doesn’t give us that. From here we switch to the recidivism data set, where we do have the true outcome for every defendant.

install.packages("fairmodels")

The COMPAS algorithm was used by courts across the United States to predict whether a defendant would reoffend within two years. Judges used these predictions to inform decisions about bail, sentencing, and parole. In 2016, the investigative journalism organization ProPublica analyzed COMPAS scores and actual reoffending records for defendants in Broward County, Florida.

This data set is unusual and valuable: it contains the true outcome, whether each defendant genuinely reoffended within two years. That lets us check what a model might predict, and then check whether those predictions will be fair across racial groups.

Part 2: Does the Model Satisfy Independence?

Recall that Independence requires the flagging rate to be the same across racial groups.

Q9. Use crime_predictions to evaluate whether independence holds. Are flagging rates the same across racial groups?

crime_predictions %>%
  ____

Looking at these values tells us whether the rates differ, but not whether those differences are large enough to rule out chance. We can use a linear regression to test whether group means are really different.

lm(classifier ~ race, data = crime_predictions) %>%
  broom::tidy()
# A tibble: 6 × 5
  term                 estimate std.error statistic  p.value
  <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)            0.245     0.0226    10.8   3.25e-26
2 raceAfrican_American   0.320     0.0293    10.9   1.40e-26
3 raceAsian             -0.245     0.209     -1.18  2.40e- 1
4 raceHispanic           0.0632    0.0502     1.26  2.09e- 1
5 raceNative_American    0.0881    0.269      0.328 7.43e- 1
6 raceOther             -0.0867    0.0560    -1.55  1.22e- 1

Q10. Interpret what the regression output says about independence.

The 80% rule says that if one group’s rate of being predicted not to reoffend falls below 80% of the highest group’s rate, that signals a potential problem. It’s used in U.S. employment law, and is a great example of a policy choice masquerading as a statistical one, that lets everyone pretend there’s an objective answer. Still, it highlights the difference between statistical significance and economic significance: 80% and 81% can be statistically significant if there’s enough data, but it’s probably too small of a difference to actually be meaningful.

Q11. Does the flagging rate for African Americans fall below the 80% threshold compared to Caucasians?

crime_predictions %>%
  ____

Part 3: Does the Model Satisfy Separation?

Recall that Separation requires two conditions to hold simultaneously across racial groups: the false positive rate must be equal, and the false negative rate must be equal.

Q12. Interpret the differences in FPR and FNR between Caucasians and African Americans. What do they mean, and who gets hurt by these differences in each kind of mistake rates?

crime_predictions %>%
  group_by(race) %>%
  summarize(
    FPR = sum(reoffended == 0 & classifier == 1) / sum(reoffended == 0),
    FNR = sum(reoffended == 1 & classifier == 0) / sum(reoffended == 1),
    n = n()
  )
# A tibble: 6 × 4
  race               FPR   FNR     n
  <fct>            <dbl> <dbl> <int>
1 Caucasian        0.152 0.596   420
2 African_American 0.389 0.279   623
3 Asian            0     1         5
4 Hispanic         0.227 0.561   107
5 Native_American  0.5   1         3
6 Other            0.08  0.719    82

Q13. Let’s conduct formal hypothesis tests using lm() to see if the FPR and FNR are equal across groups. Do these tests point to separation being violated here? Explain.

# FPR regression: among those who did not reoffend, 
# does race predict being wrongly flagged?
crime_predictions %>%
  filter(reoffended == 0) %>%
  lm(classifier ~ race, data = .) %>%
  broom::tidy()
# A tibble: 6 × 5
  term                 estimate std.error statistic  p.value
  <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)            0.152     0.0259     5.84  8.06e- 9
2 raceAfrican_American   0.238     0.0358     6.64  6.36e-11
3 raceAsian             -0.152     0.245     -0.619 5.36e- 1
4 raceHispanic           0.0758    0.0580     1.31  1.92e- 1
5 raceNative_American    0.348     0.299      1.16  2.44e- 1
6 raceOther             -0.0715    0.0650    -1.10  2.72e- 1
# FNR regression: among those who did reoffend, 
# does race predict being missed?
crime_predictions %>%
  filter(reoffended == 1) %>%
  mutate(missed = 1 - classifier) %>%
  lm(missed ~ race, data = .) %>%
  broom::tidy()
# A tibble: 6 × 5
  term                 estimate std.error statistic  p.value
  <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)            0.596     0.0373    16.0   1.18e-47
2 raceAfrican_American  -0.317     0.0452    -7.02  6.62e-12
3 raceAsian              0.404     0.331      1.22  2.23e- 1
4 raceHispanic          -0.0352    0.0817    -0.431 6.67e- 1
5 raceNative_American    0.404     0.467      0.865 3.88e- 1
6 raceOther              0.123     0.0903     1.36  1.75e- 1

ProPublica’s critique of COMPAS focused on the FPR being roughly twice as high for Black defendants, which is a violation of Separation. The company that built COMPAS responded that their model satisfied a the third fairness criterion: Calibration.

Calibration asks whether the same predicted score means the same thing across groups. If the model gives a defendant a 70% chance of reoffending, that probability should be equally accurate for Black and White defendants. We can check for calibration by asking: once we control for the model’s predicted score, does race still predict who actually reoffended?

lm(reoffended ~ prediction + race, data = crime_predictions) %>%
  broom::tidy()
# A tibble: 7 × 5
  term                 estimate std.error statistic  p.value
  <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)           0.260      0.0242   10.7    8.78e-26
2 prediction            0.461      0.0360   12.8    2.93e-35
3 raceAfrican_American  0.0269     0.0310    0.867  3.86e- 1
4 raceAsian             0.0339     0.208     0.162  8.71e- 1
5 raceHispanic         -0.00333    0.0502   -0.0663 9.47e- 1
6 raceNative_American  -0.163      0.269    -0.608  5.43e- 1
7 raceOther             0.0371     0.0560    0.663  5.07e- 1

Q14. Interpret the coefficient from the regression above on raceAfrican_American. It says, once we control for the model’s prediction, race (does/does not) have statistically significant predictive power over whether the person will reoffend.

Models tend to satisfy calibration automatically because it is what they are optimized to do: minimize prediction error across all observations. A well-trained model that assigns a 70% reoffending probability to a defendant should be right about 70% of the time regardless of race, simply because the model is trying to be as accurate as possible for everyone.

Part 4: Practice Problems

Q15. A tech company uses a model to screen job applications. The model advances 40% of male applicants and 25% of female applicants to the interview stage.

  1. Which fairness criterion is certainly being violated here?

  2. Apply the 80% rule. Does the gap signal a potential problem?

  3. The company argues that male applicants in their data set have more years of relevant experience on average. Does that settle the question of fairness?

Q16. A model screens patients for a disease. Here are the results:

Group A Group B
Truly have disease (\(Y=1\)) 500 500
Correctly identified 400 300
Truly healthy (\(Y=0\)) 2000 2000
Wrongly flagged 200 100
  1. Compute the FPR (wrongly flagged / truly healthy) and FNR (missed / truly have the disease) for each group.

  2. Does the model satisfy Separation? Which group bears the higher cost, and what is that cost?

Download this assignment

Here’s a link to download this assignment.