install.packages("tree")
install.packages("randomForest")15 Random Forests
Today, we’ll need two new packages: tree and randomForest:
Load the mortgage data:
Part 1: Using tree to fit a decision tree
The tree() function grows a decision tree by repeatedly searching for splits in the data that best separate the outcome variable. tree() takes as arguments:
- A
formula. We’ll useapproved ~ .to say thatapprovedis our dependent variable, and we want to use all other variables in the data set as explanatory variables. - A data set. We’ll train the decision tree on
mortgage_train, and then test its accuracy onmortgage_test. - A splitting criterion
split. We’ll use"gini"impurity. - A control list, which lets us specify:
nobs: the number of observations in the training setmincut: a stopping criterion for the minimum number of observations to include in either child node. The default is 5.
Play with mincut and run the code below many times.
When you hold mincut steady, does the tree change in structure a lot or a little? Are decision trees high or low variance models?
What values can mincut take on without giving you an error? How does the structure of the decision tree change when you increase or decrease mincut?
How does the test accuracy of the decision tree change when you change mincut?
mortgage_train <- mortgage %>% slice_sample(n = 1000)
tree(
formula = factor(approved) ~ .,
data = mortgage_train,
split = "gini",
control = tree.control(
nobs = nrow(mortgage_train),
mincut = ____
)
) %>%
{
# Plot the tree
plot(.)
text(., pretty = 0)
# Assess the test accuracy of the tree
predictions <- predict(., newdata = mortgage, type = "class")
mean(predictions == mortgage$approved)
}Decision trees are powerful prediction models because they are easy to interpret, flexible, and capture nonlinear relationships and interactions automatically. But the issue with decision trees is that they are high variance models: different training sets lead to very different decision trees. It’s hard to tell which tree we should trust, especially when they all lead to similar test accuracy.
To combat the variance of the decision tree, we have 2 tools:
- Bagging (“bootstrap aggregating”)
- Random Forests
We’ll explore each of these in turn.
Part 2: Using randomForest to do bagging
Decision trees have high variance because small changes in the training data can lead to very different trees. Bagging helps solve this problem.
Bagging works by:
- Taking many bootstrap samples of the training data (explanation below)
- Fitting a decision tree to each bootstrap sample
- Averaging the predictions across all trees
Instead of trusting one unstable tree, bagging combines many unstable trees into one more stable prediction model. The benefit is that variance goes down, but it comes at a cost: when you average along many decision trees, you only get predictions, not a nice visual decision tree to be able to interpret.
Bootstrapping
In machine learning and statistics, a bootstrap sample is a new data set created by randomly sampling observations from the original data set with replacement. “With replacement” means that after an observation is selected, it is put back into the data set before the next draw. As a result, some observations may appear multiple times, and others may not appear at all. For example, suppose the original data set has 5 oservations:
A B C D E
One boostrap sample might look like:
A C C D E
And another might look like:
B B B D A
In R, we can use randomForest() to fit a bagging model.
What is the relationship between B (the number of trees to average over) and the test accuracy?
Run the code below many times. Does test accuracy change much? This should point to the fact that bagging helps lower model variance.
mortgage_train <- mortgage %>% slice_sample(n = 1000)
# number of trees to average over
B <- ____
bagged_model <- randomForest(
factor(approved) ~ .,
data = mortgage_train,
ntree = B,
mtry = 9 # number of explanatory variables
)
# Notice bagged_model gives us a confusion matrix:
bagged_model
predictions <- predict(
bagged_model,
newdata = mortgage,
type = "class"
)
# Test accuracy:
mean(predictions == mortgage$approved)Part 3: Using randomForest to fit a random forest
Random forests are very similar to bagging, with one important difference: at every split, the algorithm only considers a random subset of explanatory variables. This forces different trees to explore different patterns in the data.
A random forest is still made up of decision trees, but now the trees are more diverse. This lowers variance even further and often improves prediction accuracy.
Run the code below several times with different values for B.
What is the relationships between B, C, and prediction accuracy?
How much does prediction accuracy change when you run the code several times holding B and C constant?
Compare random forest accuracy to a single decision tree and also to bagging. What has the highest, most stable accuracy?
mortgage_train <- mortgage %>% slice_sample(n = 1000)
# number of trees to average over
B <- ____
# number of explanatory variables to try each node of each decision tree. This number must be less than 9.
C <- ____
rf_model <- randomForest(
factor(approved) ~ .,
data = mortgage_train,
ntree = B,
mtry = C
)
predictions <- predict(
rf_model,
newdata = mortgage,
type = "class"
)
mean(predictions == mortgage$approved)Part 4: Assessing variable importance in the random forest
One weakness of random forests and bagging is interpretability. Instead of a single tree, we now have many trees averaged together. However, we can still learn something important: which variables seem most useful for prediction in a random forests?
One simple way to measure variable importance is:
- Fit a model normally
- Replace one variable with random numbers (“scramble” it)
- See how much prediction accuracy falls
The amount by which accuracy falls is a measure of that variable’s importance.
Let’s try this idea.
mortgage_train <- mortgage %>% slice_sample(n = 10000)
rf_model <- randomForest(
factor(approved) ~ .,
data = mortgage_train,
ntree = 100,
mtry = 3
)
baseline_predictions <- predict(
rf_model,
newdata = mortgage,
type = "class"
)
baseline_accuracy <- mean(
baseline_predictions == mortgage$approved
)
baseline_accuracy
# Now let's scramble debt to income ratio:
mortgage_scrambled <- mortgage %>%
mutate(
bad_credit = sample(bad_credit)
)
scrambled_predictions <- predict(
rf_model,
newdata = mortgage_scrambled,
type = "class"
)
baseline_accuracy - mean(scrambled_predictions == mortgage$approved)Repeat the scrambling exercise for:
- debt_to_income_ratio
- income
- loan_amount
- loan_to_value_ratio
- property_value
Which variable seems most important for prediction?
Which variables seem least important?
Why does scrambling an important variable hurt prediction accuracy?
Part 5: Reflection
Explain how a decision tree makes predictions.
Why are decision trees considered high variance models?
How does bagging improve on a single decision tree?
How do random forests improve on bagging?
How does variable importance work to tell us about a prediction model?
Download this assignment
Here’s a link to download this assignment.