Decision trees

Spring break detective: did they go to a beach?

Small sample dataset

Beach/ocean photos posted	Got sunburned?	Sand in bag/shoes?	Went to a beach?
0	no	no	no
0	yes	no	no
1	no	no	no
1	no	yes	yes
2	yes	no	yes
2	no	yes	yes
3	no	no	yes
4	yes	yes	yes

Build the tree stage by stage

Press the button to add the next split.

Step 0 of 5

Start with the root question. We first ask whether the number of beach/ocean photos posted is at least 1.

Cornell notes - decision trees

Decision Tree: What we want

Goal 1: Compact, use as few nodes as possible while still separating classes
Goal 2: Pure Leaves, leaf inpurity near 0 means predictions are more reliable
Unfortunately finding smallest tree that classifies best is NP hard!
Use greedy approach: On each iteration pick feature and split that that most reduces impurity

Consistency question: Can we always find a perfectly separating Decision tree?

A consistent tree exists exactly when there are no duplicate feature vectors carrying conflicting labels. If identical inputs have different labels, no deterministic tree can classify them both correctly.

Cornell notes - full lecture scikit-learn trees

Impurity Function of a Dataset

Impurity Function

\[ S = \{ (x_1,y_1),\ldots,(x_n,y_n)\}, ~~ \forall i, y_i \in \{1,\ldots,c\} \]

Impurity function \(I(S)\) measures how much points in \(S\) have different labels.

Each part of a split can have different Imputities: Impurity of a split

\[ I_T(S)=\frac{|S_L|}{|S|}I(S_L)+\frac{|S_R|}{|S|}I(S_R) \]

Choose the split that minimizes the weighted average impurity of the left and right children where \(S_L=\{(x,y)\in S: x_f\le t\}\) and \(S_R=\{(x,y)\in S:x_f>t\}\).

Cornell notes - impurity functions scikit-learn tree guide

How do we measure “purity”?

For a subset \(S\), let \(p_k\) be the fraction of examples in class \(k\). That is, \(p_k = \frac{\left|\left\{(x,y) \in S: y = k\right\}\right|}{|S|} \)

Max impurity

\[ I_M(S)= 1 - \max_{k} p_k \]

Error of Bayes Optimal.

Entropy

\[ I_H(S)=-\sum_{k=1}^{c} p_k\log p_k \]

Entropy is another way to penalize uncertainty. Pure leaves have zero entropy; balanced leaves have larger entropy.

Gini impurity

\[ I_G(S)=\sum_{k=1}^{c} p_k(1-p_k) \]

Binary case intuition: maximum impurity occurs around a 50/50 class mix, and impurity drops to \(0\) at a pure leaf.

Variance

\[ I_V(S)= \mathrm{Var}(y_1,\ldots, y_{|S|}) \]

Cornell notes - impurity functions scikit-learn tree guide

Interactive demo: build the tree by hand

Propose a split by clicking inside the active region, inspect the weighted average impurity, and then fix the split. Each region displays its current majority-vote class label.

Decision tree learning reference

ID3 Algorithm

ID3(S):

  If I(S) is small or can't split anymore:
      return a leaf with the majority label in S

  For each candidate split (f, t):
      S_L = { (x, y) in S : x_f ≤ t }
      S_R = { (x, y) in S : x_f > t }
      Compute I_T(S; f, t) = (|S_L| / |S|) I(S_L) + (|S_R| / |S|) I(S_R)

  Choose the split (f*, t*) with the smallest I_T(S; f*, t*)

  Create an internal node with test:   x_f* ≤ t* ?
  Left child  = ID3(S_L)
  Right child = ID3(S_R)

  return the resulting tree

Why not stop when no immediate impurity improvement happens?

Because some useful structures only emerge after multiple coordinated splits. The XOR pattern is the canonical example: no single first split solves it, but a depth-2 tree does.

Cornell notes - ID3

Interactive demo: run ID3 automatically

Choose an impurity criterion, then click a button to run the greedy ID3-style procedure and display its decision boundary on the dataset.

ID3 reference

CART: the same structure, but with continuous targets

For regression, leaves predict numeric values. The notes use squared loss as impurity and predict the mean label in each leaf.

Regression-tree impurity

\[ L(S)=\frac{1}{|S|}\sum_{(x,y)\in S}(y-\bar{y}_S)^2 \]

Here \(\bar{y}_S\) is the average label in subset \(S\). A split is good if it creates children whose values are tightly concentrated around their own means.

Prediction at a leaf

\[ \hat{y}(x)=\bar{y}_{\text{leaf}(x)} \]

The learned function becomes piecewise constant: each leaf corresponds to a region with its own average response.

Classification view

Leaf stores a class label or class histogram.

output = majority label

Regression view

Leaf stores a real number, usually the mean of the training targets in that leaf.

output = local mean

Cornell notes - CART

Interactive demo: regression tree

A CART-style regression demo. Adjust depth, leaf size, and noise to see how the piecewise-constant fit changes and how the split locations move.

Regression tree reference

Overfitting With Decision Trees

If we got to max depth we can optimally fit training data but will also overfit
Mitigation 1: put consrtraint on max depth
Migitigation 2: Put a constraint on minimum leaf size.
Cool Mitigation: Next lecture, allow max depth but ensemble many of them.

Cornell notes - full lecture scikit-learn trees

Interactive demo: depth vs training and validation error

This classification demo sweeps tree depth on a fixed train/validation split and plots how the training and validation errors evolve.

Decision-tree reference

Summary

A decision tree is a recursive partition of feature space.
Training is greedy: pick next split with least impurity.
Different imprity functions: entropy, Gini etc
CART extends the same template to regression with squared loss.
Can overfit but there are cool mitigations to come soon.

Cornell notes - full lecture scikit-learn trees

Beach/ocean photos posted	Got sunburned?	Sand in bag/shoes?	Went to a beach?
0	no	no	no
0	yes	no	no
1	no	no	no
1	no	yes	yes
2	yes	no	yes
2	no	yes	yes
3	no	no	yes
4	yes	yes	yes

Beach/ocean photos posted	Got sunburned?	Sand in bag/shoes?	Went to a beach?
0	no	no	no
0	yes	no	no
1	no	no	no
1	no	yes	yes
2	yes	no	yes
2	no	yes	yes
3	no	no	yes
4	yes	yes	yes

Beach/ocean photos posted	Got sunburned?	Sand in bag/shoes?	Went to a beach?
0	no	no	no
0	yes	no	no
1	no	no	no
1	no	yes	yes
2	yes	no	yes
2	no	yes	yes
3	no	no	yes
4	yes	yes	yes