Making Predictions with Decision Trees: Regression Trees

Learn how decision trees predict numbers by splitting data.

Predicting Numbers with Decision Trees: Regression Trees

Imagine playing a game of “Guess the Number.” You ask questions like “Is it bigger than 50?” or “Is it even?” to narrow down the possibilities. Decision Trees in machine learning work similarly!

They are flowchart-like structures that ask questions about your data’s features to guide you to a final prediction. Decision trees can predict categories (like “Will this customer buy?” - Classification Tree) or continuous numbers (like “How many hours will someone play tennis?” - Regression Tree). Today, we’ll focus on understanding Regression Trees.

What is a Decision Tree, Anyway?

Splitting Data Based on Questions

A decision tree learns by splitting the dataset into smaller and smaller subsets. At each step, it asks a question about one of the input features (e.g., “Is the weather Outlook Sunny?”). Based on the answer, the data goes down a specific branch.

The main goal when building a tree is to make the resulting groups (at the end of the branches) as “pure” or homogeneous as possible regarding the value we want to predict.

For Regression Trees, “homogeneous” means the numerical target values within a group are very close to each other (low variation).
For Classification Trees, it means most items in a group belong to the same category.

Know Your Tree Parts

Root Node: The top-level node where the first split happens.
Decision Node (Internal Node): A node that asks a question and splits the data further.
Leaf Node (Terminal Node): An end node that doesn’t split anymore. It provides the final prediction.
Subtree: A section of the tree starting from a decision node.

How Regression Trees Choose the Best Split

Goal: Reduce the Spread (Standard Deviation)

How does the tree decide which question to ask at each step? For regression trees, a common method is to choose the split that results in the biggest reduction in the spread or variation of the target variable.

We often measure this spread using the Standard Deviation (SD). A low SD means the numbers in a group are very similar; a high SD means they are spread out. The tree wants to create groups (leaves) with the lowest possible SD.

Standard Deviation measures the average distance of data points from their mean.

The SDR Calculation Steps

At any decision node, the algorithm considers all possible splits:

Measure Current Spread: Calculate the Standard Deviation (SD_parent) of the target variable for all data points currently in this node.
Test Potential Splits: For every feature:
- Consider splitting based on its values.
- For each potential split, calculate the SD within each resulting child group.
- Calculate the Weighted Average SD: Weighted_SD = (Fraction_in_Child1 * SD_Child1) + (Fraction_in_Child2 * SD_Child2) + ...
Calculate SDR: For each potential split: SDR = SD_Parent - Weighted_Average_SD_Children
Choose Best Split: Select the feature and split value that yields the Maximum SDR.
Repeat: Apply this process recursively to new child nodes until a stopping condition is met.

When to Stop Splitting?

The tree stops growing branches when:

The Standard Deviation in a node is already very low (data is homogeneous).
The node contains too few data points to split further reliably (e.g., less than 5 samples).
A pre-set maximum tree depth is reached.

The Prediction in a Leaf Node

Once a data point reaches a leaf node, what’s the prediction? For a regression tree, it’s simple: the prediction is the average (mean) of the target variable for all the training data points that ended up in that leaf.

Example: Predicting Tennis Hours Played

Suppose at the Root Node (all 14 data points):

Mean Hours = 39.8
SD (Parent) = 9.32

Testing split by ‘Outlook’:

Outlook = Sunny (5 points), SD = 10.87
Outlook = Overcast (4 points), SD = 0
Outlook = Rainy (5 points), SD = 7.78

Weighted Average SD: (5/14 * 10.87) + (4/14 * 0) + (5/14 * 7.78) = 6.66

SDR (Outlook): 9.32 - 7.66 = 1.66

If this is the highest SDR across all features, then ‘Outlook’ is chosen as the first split.

Key Terms Recap

Term	Definition
Decision Node	Where data splits based on a feature’s condition.
Root Node	The very first split/decision node at the top.
Leaf Node	End node with the final prediction (average value).
Subtree	A branch and its subsequent nodes/leaves.
Standard Deviation (SD)	Measures the spread or variation of numerical data.
Standard Deviation Reduction (SDR)	The decrease in SD achieved by a split. Used to choose the best split.

Common Misunderstandings

Regression vs. Classification Prediction: Regression trees predict an average number at the leaves, while classification trees predict a category label.
Best Split ≠ Most Categories: A feature isn’t chosen just because it has many values. It’s chosen if splitting reduces the target variable’s variance the most (highest SDR).
Instability: Single decision trees can change significantly with small data changes. Ensemble methods like Random Forests are often more robust.

Summary: Regression Trees

Regression Trees predict continuous numbers.
They work by recursively partitioning data based on input features.
Splits aim to create groups with low variance (low Standard Deviation) in the target variable.
The best split is chosen using Standard Deviation Reduction (SDR).
Splitting stops based on criteria like low variance, minimum sample size, or max depth.
Leaf nodes predict the average of the target variable for samples in that leaf.