We will discuss how to use denial_reason in our model: can it create data leakage?
We’ll try try to build the highest-performing model possible while avoiding using protected demographic variables.
And finally, we’ll start thinking about how to evaluate the model on discrimination and fairness by considering how it might use proxy variables for protected demographic variables like race and ethnicity.
denial_reason
So far, we haven’t used the variable denial_reason.
denial_reason is the reason the lender cited for denying the mortgage application. It’s a little dangerous to put in our model because it could create data leakage: it’s important to avoid using variables during training that are not yet available at the time prediction is made. Using these kinds of variables can make your model appear highly accurate during testing, but perform poorly in practice.
But let’s look closer at what denial_reason tells us: the primary reason for denying the application was…
1: Debt-to-income ratio
2: Employment history
3: Credit history
4: Collateral
5: Insufficient cash (downpayment, closing costs)
6: Unverifiable information
7: Credit application incomplete
8: Mortgage insurance denied
9: Other
10: Not applicable (the application was approved)
The code 3: Credit history is interesting: up until now, we haven’t been able to include anything about the applicant’s credit history. So while including denial_reason may create data leakage, let’s try to use denial_reason to create some variables for our model that would certainly be available at the time of prediction.
Question #1
Use denial_reason to guide your feature engineering. In particular, create variables related to:
poor credit history
identifying an appropriate debt-to-income threshold
insufficient collateral (cases where the property appraisal may have been too low)
incomplete applications or missing documentation
All these variables would be easily observable by the algorithm at the time of prediction, so we should be OK when it comes to data leakage.
Then experiment with logs, squares, interactions, buckets, as well as these 4 new variables derived from denial_reason. Exclude protected demographic variables (age, sex, race, and ethnicity). Evaluate the LPM, Logit, and KNN using cross-validated AUC. What’s the highest CV AUC you can achieve?
Copy-paste the highest CV AUC you were able to achieve in any of the 3 models:
lpm logit knn
____________________________
Hidden Proxies and Fairness
Let’s consider why this data has been made public in the first place.
Mortgage lenders in the United States are required to report detailed information about mortgage applications under the Home Mortgage Disclosure Act (HMDA). This data helps regulators monitor whether lenders are meeting community housing needs, identify potential discriminatory lending patterns, and support public policy decisions related to housing and credit access.
When mortgage lenders publish this data, they are hoping that we will find that their decisions are based only on financial risk, and do not depend in any way on whether the applicant is a member of a protected class (by their race, ethnicity, sex, and age).
One piece of evidence in their favor would be: If we add protected demographic variables to a prediction model, the model should not become meaningfully better at matching the lender’s decisions.
Question #2
Include Protected Demographics
Take your model from the previous section and add in race, ethnicity, sex, and age. Do the predictions get a lot better? (Yes/No), which (does/does not) point to lenders certainly discriminating.
lpm logit knn
____________________________
Hidden Proxies
But even if protected variables are excluded, models may still recover similar information indirectly through other variables. These are called proxy variables.
A proxy variable is a variable that is strongly correlated with a protected characteristic.
Question #3
Include Census Tract Variables
Take your model from the previous section. Remove race, ethnicity, sex, and age, and add in all variables that start with “tract”. Do the predictions get a lot better? (Yes/No), which (does/does not) point to lenders certainly discriminating based on these proxies.
lpm logit knn
____________________________
tract_minority_population_percent might be a proxy for (race/ethnicity/age/sex) because ______.
tract_owner_occupied_units might be a proxy for (race/ethnicity/age/sex) because ______.
tract_median_age_of_housing_units might be a proxy for (race/ethnicity/age/sex) because ______.
Next, consider: what else could be a proxy?
The tract variables are not the only possible proxies in this data set. Suppose we remove race and ethnicity from the model. Could we still predict a person’s race or ethnicity using the remaining variables? If the answer is yes, then those remaining variables still contain demographic information. Let’s investigate.
Question #4
Draw plots exploring how all the variables below differ across racial and ethnic groups. Which seem to be especially correlated?
income
loan_amount
property_value
loan_type
loan_purpose
occupancy_type
debt_to_income_ratio
loan_to_value_ratio
bad credit
high debt_to_income_ratio
insufficient collateral
incomplete credit applications
Question #5: Can we predict race or ethnicity?
Build a prediction model where the dependent variable is race or ethnicity instead of approved. What variables seem to be good proxies for race and ethnicity, and how high of a CV AUC can you get?
Conclusion: What might discrimination actually look like?
Suppose a lender never directly uses race in their algorithm. Discrimination can still occur: lenders can use proxy variables and still systematically disadvantage certain groups.
This is one reason fairness in machine learning is difficult: Removing protected variables does not necessarily remove protected information.