Skip to content
Pablo Rodriguez

Regression Trees

Decision trees can be generalized beyond classification to handle regression tasks where the goal is predicting numerical values rather than discrete categories.

Previous: Predict if animal is cat (classification) Now: Predict weight of animal (regression)

Input Features (unchanged):

  • Ear Shape (pointy/floppy)
  • Face Shape (round/not round)
  • Whiskers (present/absent)

Output Variable:

  • Y = weight in pounds (continuous numerical value)
Regression Task

Predicting a number rather than a category.

Sample regression tree:

  • Root: Split on Ear Shape
  • Left Subtree (Pointy): Split on Face Shape
  • Right Subtree (Floppy): Split on Face Shape

Note: Same feature can appear in multiple branches - this is perfectly valid in decision trees.

Classification vs. Regression Difference:

  • Leaf nodes predict class labels
  • Example: “Cat” or “Not Cat”
  • Discrete categorical output

Leaf Node Calculation: Average of training examples reaching that node

Example Leaf Node:

  • Training examples reaching node: Weights [7.2, 7.6, 9.2, 10.2]
  • Prediction: (7.2 + 7.6 + 9.2 + 10.2) ÷ 4 = 8.35 pounds

Additional Examples:

  • Single example node: Weight 9.2 → Prediction: 9.2 pounds
  • Multiple examples: [15.0, 18.0, 20.1] → Prediction: 17.70 pounds
  • Two examples: [9.1, 10.7] → Prediction: 9.90 pounds

Classification Trees: Minimize entropy (measure of class impurity) Regression Trees: Minimize variance (measure of numerical spread)

Variance Definition: How widely a set of numbers varies

Example Comparisons:

  • Low variance: [7.2, 9.2, 8.1, 10.2] → Variance = 1.47
  • High variance: [8.8, 15.0, 11.0, 18.0, 20.0] → Variance = 21.87

Interpretation: Higher variance indicates more spread in values, suggesting need for further splitting.

Similar structure to classification, but using variance instead of entropy:

Split Evaluation:

Weighted Variance = w^left × Variance(left) + w^right × Variance(right)

Example: Ear shape split

  • w^left = 5/10, w^right = 5/10
  • Left variance = 1.47, Right variance = 21.87
  • Weighted variance = 0.5 × 1.47 + 0.5 × 21.87 = 11.67

Formula:

Variance Reduction = Root Variance - Weighted Variance After Split

Calculation Examples:

Ear Shape Split

Root Variance: 20.51 Weighted Variance: 11.67 Variance Reduction: 8.84

  • Best reduction

Face Shape Split

Root Variance: 20.51 Weighted Variance: 19.87 Variance Reduction: 0.64

  • Minimal improvement

Whiskers Split

Root Variance: 20.51 Weighted Variance: 14.29 Variance Reduction: 6.22

  • Moderate improvement

Selection Rule: Choose feature with largest variance reduction

Example Result: Ear shape (8.84) > Whiskers (6.22) > Face shape (0.64) Decision: Split on ear shape

After selecting ear shape:

  1. Left Branch: 5 examples with pointy ears
  2. Right Branch: 5 examples with floppy ears
  3. Repeat process: Apply same algorithm to each subset
  4. Continue until stopping criteria met

Similar to classification trees:

  • Maximum depth reached
  • Minimum examples per node
  • Variance reduction below threshold
  • Pure node (all examples have same value - rare in practice)

Summary: Classification vs. Regression Trees

Section titled “Summary: Classification vs. Regression Trees”
AspectClassification TreesRegression Trees
OutputCategories/ClassesNumerical Values
Leaf PredictionMost common classAverage of values
Splitting CriterionEntropy/Information GainVariance Reduction
Impurity MeasureClass mixtureValue spread
  • Tree building process: Identical recursive approach
  • Feature selection: Choose best split at each node
  • Stopping criteria: Similar threshold-based approaches
  • Prediction process: Follow path from root to leaf

Regression trees extend the power of decision trees beyond classification, enabling prediction of continuous numerical outcomes while maintaining the interpretable tree structure and systematic splitting approach.