15 Random Forests

Today, we’ll need two new packages: tree and randomForest:

install.packages("tree")
install.packages("randomForest")

Load the mortgage data:

Part 1: Using `tree` to fit a decision tree

The tree() function grows a decision tree by repeatedly searching for splits in the data that best separate the outcome variable. tree() takes as arguments:

A formula. We’ll use approved ~ . to say that approved is our dependent variable, and we want to use all other variables in the data set as explanatory variables.
A data set. We’ll train the decision tree on mortgage_train, and then test its accuracy on mortgage_test.
A splitting criterion split. We’ll use "gini" impurity.
A control list, which lets us specify:
- nobs: the number of observations in the training set
- mincut: a stopping criterion for the minimum number of observations to include in either child node. The default is 5.

Question #1

Play with mincut and run the code below many times.

When you hold mincut steady, does the tree change in structure a lot or a little? Are decision trees high or low variance models?
What values can mincut take on without giving you an error? How does the structure of the decision tree change when you increase or decrease mincut?
How does the test accuracy of the decision tree change when you change mincut?

mortgage_train <- mortgage %>% slice_sample(n = 1000)

tree(
  formula = factor(approved) ~ .,
  data = mortgage_train,
  split = "gini",
  control = tree.control(
    nobs = nrow(mortgage_train),
    mincut = ____
    )
) %>%
  {
    # Plot the tree
    plot(.)
    text(., pretty = 0)
    
    # Assess the test accuracy of the tree
    predictions <- predict(., newdata = mortgage, type = "class")
    
    mean(predictions == mortgage$approved)
  }

Decision trees are powerful prediction models because they are easy to interpret, flexible, and capture nonlinear relationships and interactions automatically. But the issue with decision trees is that they are high variance models: different training sets lead to very different decision trees. It’s hard to tell which tree we should trust, especially when they all lead to similar test accuracy.

To combat the variance of the decision tree, we have 2 tools:

Bagging (“bootstrap aggregating”)
Random Forests

We’ll explore each of these in turn.

Part 2: Using `randomForest` to do bagging

Decision trees have high variance because small changes in the training data can lead to very different trees. Bagging helps solve this problem.

Bagging works by:

Taking many bootstrap samples of the training data (explanation below)
Fitting a decision tree to each bootstrap sample
Averaging the predictions across all trees

Instead of trusting one unstable tree, bagging combines many unstable trees into one more stable prediction model. The benefit is that variance goes down, but it comes at a cost: when you average along many decision trees, you only get predictions, not a nice visual decision tree to be able to interpret.

Bootstrapping

In machine learning and statistics, a bootstrap sample is a new data set created by randomly sampling observations from the original data set with replacement. “With replacement” means that after an observation is selected, it is put back into the data set before the next draw. As a result, some observations may appear multiple times, and others may not appear at all. For example, suppose the original data set has 5 oservations:

A B C D E

One boostrap sample might look like:

A C C D E

And another might look like:

B B B D A

In R, we can use randomForest() to fit a bagging model.

Question #2

What is the relationship between B (the number of trees to average over) and the test accuracy?
Run the code below many times. Does test accuracy change much? This should point to the fact that bagging helps lower model variance.

mortgage_train <- mortgage %>% slice_sample(n = 1000)

# number of trees to average over
B <- ____

bagged_model <- randomForest(
  factor(approved) ~ .,
  data = mortgage_train,
  ntree = B,
  mtry = 9 # number of explanatory variables
)

# Notice bagged_model gives us a confusion matrix:
bagged_model

predictions <- predict(
  bagged_model,
  newdata = mortgage,
  type = "class"
)

# Test accuracy:
mean(predictions == mortgage$approved)

Part 3: Using `randomForest` to fit a random forest

Random forests are very similar to bagging, with one important difference: at every split, the algorithm only considers a random subset of explanatory variables. This forces different trees to explore different patterns in the data.

A random forest is still made up of decision trees, but now the trees are more diverse. This lowers variance even further and often improves prediction accuracy.

Question #3

Run the code below several times with different values for B.

What is the relationships between B, C, and prediction accuracy?
How much does prediction accuracy change when you run the code several times holding B and C constant?
Compare random forest accuracy to a single decision tree and also to bagging. What has the highest, most stable accuracy?

mortgage_train <- mortgage %>% slice_sample(n = 1000)

# number of trees to average over
B <- ____

# number of explanatory variables to try each node of each decision tree. This number must be less than 9.
C <- ____

rf_model <- randomForest(
  factor(approved) ~ .,
  data = mortgage_train,
  ntree = B,
  mtry = C
)

predictions <- predict(
  rf_model,
  newdata = mortgage,
  type = "class"
)

mean(predictions == mortgage$approved)

Part 4: Assessing variable importance in the random forest

One weakness of random forests and bagging is interpretability. Instead of a single tree, we now have many trees averaged together. However, we can still learn something important: which variables seem most useful for prediction in a random forests?

One simple way to measure variable importance is:

Fit a model normally
Replace one variable with random numbers (“scramble” it)
See how much prediction accuracy falls

The amount by which accuracy falls is a measure of that variable’s importance.

Let’s try this idea.

mortgage_train <- mortgage %>% slice_sample(n = 10000)

rf_model <- randomForest(
  factor(approved) ~ .,
  data = mortgage_train,
  ntree = 100,
  mtry = 3
)

baseline_predictions <- predict(
  rf_model,
  newdata = mortgage,
  type = "class"
)

baseline_accuracy <- mean(
  baseline_predictions == mortgage$approved
)

baseline_accuracy

# Now let's scramble debt to income ratio:
mortgage_scrambled <- mortgage %>%
  mutate(
    bad_credit = sample(bad_credit)
  )

scrambled_predictions <- predict(
  rf_model,
  newdata = mortgage_scrambled,
  type = "class"
)

baseline_accuracy - mean(scrambled_predictions == mortgage$approved)

Question #4

Repeat the scrambling exercise for:

debt_to_income_ratio
income
loan_amount
loan_to_value_ratio
property_value

Which variable seems most important for prediction?
Which variables seem least important?
Why does scrambling an important variable hurt prediction accuracy?

Part 5: Reflection

Question #5

Explain how a decision tree makes predictions.
Why are decision trees considered high variance models?
How does bagging improve on a single decision tree?
How do random forests improve on bagging?
How does variable importance work to tell us about a prediction model?

Download this assignment

Here’s a link to download this assignment.

Part 1: Using tree to fit a decision tree

Part 2: Using randomForest to do bagging