Decision trees

Spring break detective: did they go to a beach?

Small sample dataset

Beach/ocean photos posted Got sunburned? Sand in bag/shoes? Went to a beach?
0nonono
0yesnono
1nonono
1noyesyes
2yesnoyes
2noyesyes
3nonoyes
4yesyesyes

Build the tree stage by stage

Press the button to add the next split.

Step 0 of 5
Start with the root question. We first ask whether the number of beach/ocean photos posted is at least 1.

Decision Tree: What we want

  • Goal 1: Compact, use as few nodes as possible while still separating classes

  • Goal 2: Pure Leaves, leaf inpurity near 0 means predictions are more reliable

  • Unfortunately finding smallest tree that classifies best is NP hard!

  • Use greedy approach: On each iteration pick feature and split that that most reduces impurity

Consistency question: Can we always find a perfectly separating Decision tree?

A consistent tree exists exactly when there are no duplicate feature vectors carrying conflicting labels. If identical inputs have different labels, no deterministic tree can classify them both correctly.

Impurity Function of a Dataset

Impurity Function

\[ S = \{ (x_1,y_1),\ldots,(x_n,y_n)\}, ~~ \forall i, y_i \in \{1,\ldots,c\} \]

Impurity function \(I(S)\) measures how much points in \(S\) have different labels.

Each part of a split can have different Imputities: Impurity of a split

\[ I_T(S)=\frac{|S_L|}{|S|}I(S_L)+\frac{|S_R|}{|S|}I(S_R) \]

Choose the split that minimizes the weighted average impurity of the left and right children where \(S_L=\{(x,y)\in S: x_f\le t\}\) and \(S_R=\{(x,y)\in S:x_f>t\}\).

How do we measure “purity”?

For a subset \(S\), let \(p_k\) be the fraction of examples in class \(k\). That is, \(p_k = \frac{\left|\left\{(x,y) \in S: y = k\right\}\right|}{|S|} \)

Max impurity

\[ I_M(S)= 1 - \max_{k} p_k \]

Error of Bayes Optimal.

Entropy

\[ I_H(S)=-\sum_{k=1}^{c} p_k\log p_k \]

Entropy is another way to penalize uncertainty. Pure leaves have zero entropy; balanced leaves have larger entropy.

Gini impurity

\[ I_G(S)=\sum_{k=1}^{c} p_k(1-p_k) \]

Binary case intuition: maximum impurity occurs around a 50/50 class mix, and impurity drops to \(0\) at a pure leaf.

Variance

\[ I_V(S)= \mathrm{Var}(y_1,\ldots, y_{|S|}) \]

Interactive demo: build the tree by hand

Propose a split by clicking inside the active region, inspect the weighted average impurity, and then fix the split. Each region displays its current majority-vote class label.

ID3 Algorithm

ID3(S):

  If I(S) is small or can't split anymore:
      return a leaf with the majority label in S

  For each candidate split (f, t):
      S_L = { (x, y) in S : x_f ≤ t }
      S_R = { (x, y) in S : x_f > t }
      Compute I_T(S; f, t) = (|S_L| / |S|) I(S_L) + (|S_R| / |S|) I(S_R)

  Choose the split (f*, t*) with the smallest I_T(S; f*, t*)

  Create an internal node with test:   x_f* ≤ t* ?
  Left child  = ID3(S_L)
  Right child = ID3(S_R)

  return the resulting tree

Why not stop when no immediate impurity improvement happens?

Because some useful structures only emerge after multiple coordinated splits. The XOR pattern is the canonical example: no single first split solves it, but a depth-2 tree does.

Interactive demo: run ID3 automatically

Choose an impurity criterion, then click a button to run the greedy ID3-style procedure and display its decision boundary on the dataset.

CART: the same structure, but with continuous targets

For regression, leaves predict numeric values. The notes use squared loss as impurity and predict the mean label in each leaf.

Regression-tree impurity

\[ L(S)=\frac{1}{|S|}\sum_{(x,y)\in S}(y-\bar{y}_S)^2 \]

Here \(\bar{y}_S\) is the average label in subset \(S\). A split is good if it creates children whose values are tightly concentrated around their own means.

Prediction at a leaf

\[ \hat{y}(x)=\bar{y}_{\text{leaf}(x)} \]

The learned function becomes piecewise constant: each leaf corresponds to a region with its own average response.

Classification view

Leaf stores a class label or class histogram.

output = majority label

Regression view

Leaf stores a real number, usually the mean of the training targets in that leaf.

output = local mean

Interactive demo: regression tree

A CART-style regression demo. Adjust depth, leaf size, and noise to see how the piecewise-constant fit changes and how the split locations move.

Overfitting With Decision Trees

  • If we got to max depth we can optimally fit training data but will also overfit

  • Mitigation 1: put consrtraint on max depth

  • Migitigation 2: Put a constraint on minimum leaf size.

  • Cool Mitigation: Next lecture, allow max depth but ensemble many of them.

Interactive demo: depth vs training and validation error

This classification demo sweeps tree depth on a fixed train/validation split and plots how the training and validation errors evolve.

Summary

  • A decision tree is a recursive partition of feature space.

  • Training is greedy: pick next split with least impurity.

  • Different imprity functions: entropy, Gini etc

  • CART extends the same template to regression with squared loss.

  • Can overfit but there are cool mitigations to come soon.

Navigation: /Space next, previous, T list, R references, F fullscreen.