Can’t See the Random Forest for the Decision Trees

Cassie Nutter
Analytics Vidhya
Published in
4 min readFeb 21, 2021

--

Photo by Danka & Peter on Unsplash

The saying “can’t see the forest for the trees” is generally used to describe someone that is so focused on a small portion or detail that they miss the big picture. That idiom can be useful in many different scenarios, but fits especially perfect when talking about Random Forest and Decision Trees.

It would be impossible to explain Random Forest without discussing Decision Trees.

Diagram showing nodes of a Decision Tree

A Decision Tree is the building block of a Random Forest. A tree is made up of a root node, internal nodes, and leaf nodes. Starting with a root node, the Tree will make a series of splits at different features called internal nodes. Every internal node checks for a condition (like uncertainty or impurity) and performs a decision until it ends at a leaf node. A leaf node represents a discrete class or a predetermined stopping point.

A Decision Tree could be compared to stopping someone on the street and asking them a series of yes or no questions. Maybe you want to see if investing in a certain stock is a good idea. You could ask them a series of questions like, “Is now a good time to buy?”, or “Should I diversify?”, until you come to some conclusion. You can imagine this is not a very effective solution. How do you know if that person is certain?

This is where Random Forest can be beneficial. Rather than polling one person, what would happen if you asked five people, fifty people, or five hundred? The more people you ask, the more likely you are to receive some accurate and helpful information.

A Random Forest builds a specified number of Decision Trees, tallies the responses, and then accepts the most frequent response as the winner. The process behind this is similar to the Central Limit Theorem — when independent variables are added together, their normalized sum tends to resemble a bell curve. A Random Forest relies on the group to identify the correct result.

Diagram of Random Forest classification

Maybe you’re thinking, “Okay. If I can make one Decision Tree or a Random Forest with a bunch of Decision Trees, wouldn’t the Random Forest be filled with identical trees?”

And to that I would say, “How astute you are!”

In fact, that would be true. Decision Tree is a “greedy” algorithm — given the same data, the algorithm will try to get the most information gain at every step. However, to prevent a forest of identical trees, Random Forest uses Bootstrap Aggregation (or Bagging) and subspace sampling.

Photo by Siora Photography on Unsplash

Bootstrap Aggregation is used to obtain a portion of the data by sampling with replacement. Each tree is built using only two-thirds of the training data with replacement. The remaining one-third is used to check overall tree performance for each tree in the forest. The remaining one-third is called the out-of-bag data and the out-of-bag error how tree performance is quantified. Because the data is chosen randomly and with replacement, it is probable that some data may be repeated and some data may not be included.

The other method used to increase variability in a Random Forest is called subspace sampling. Subspace sampling randomly selects a subset of features that will be used as predictors for each node. Because not all features are used, not all trees will look alike.

Once the whole forest has been created, the algorithm takes the average of all individual trees if it is calculating Regression, or obtains the majority vote if predicting Classification.

Using a Random Forest has definite advantages over using Decision Trees:

  • Can be used in Regression and Classification tasks
  • Can handle large datasets with high dimensionality
  • Enhances accuracy and prevents overfitting by resisting noise and variance

However, nothing is perfect. Random Forest disadvantages include:

  • Not well suited for Regression tasks
  • Computationally expensive — can take a long time to run a model and can require a lot of computer memory, especially large datasets
  • Each tree is trained independently — does not learn from a poorly trained tree like other algorithms might

While there are many different algorithms out there for Machine Learning, Random Forest does tend to be one of the better performing models. They can out-perform more complicated and time-consuming algorithms with the help from multiple parameters that can be tuned and adjusted. Random Forests do a great job of seeing the big picture and not focusing in on individual Trees.

A walk through the woods will never be the same again.

To see projects I have done (including using Random Forest), please visit my GitHub

--

--

Cassie Nutter
Analytics Vidhya

Aspiring Data Scientist, dog lover and running enthusiast