| Beach/ocean photos posted | Got sunburned? | Sand in bag/shoes? | Went to a beach? |
|---|---|---|---|
| 0 | no | no | no |
| 0 | yes | no | no |
| 1 | no | no | no |
| 1 | no | yes | yes |
| 2 | yes | no | yes |
| 2 | no | yes | yes |
| 3 | no | no | yes |
| 4 | yes | yes | yes |
Press the button to add the next split.
Goal 1: Compact, use as few nodes as possible while still separating classes
Goal 2: Pure Leaves, leaf inpurity near 0 means predictions are more reliable
Unfortunately finding smallest tree that classifies best is NP hard!
Use greedy approach: On each iteration pick feature and split that that most reduces impurity
A consistent tree exists exactly when there are no duplicate feature vectors carrying conflicting labels. If identical inputs have different labels, no deterministic tree can classify them both correctly.
Impurity function \(I(S)\) measures how much points in \(S\) have different labels.
Choose the split that minimizes the weighted average impurity of the left and right children where \(S_L=\{(x,y)\in S: x_f\le t\}\) and \(S_R=\{(x,y)\in S:x_f>t\}\).
For a subset \(S\), let \(p_k\) be the fraction of examples in class \(k\). That is, \(p_k = \frac{\left|\left\{(x,y) \in S: y = k\right\}\right|}{|S|} \)
Error of Bayes Optimal.
Entropy is another way to penalize uncertainty. Pure leaves have zero entropy; balanced leaves have larger entropy.
Binary case intuition: maximum impurity occurs around a 50/50 class mix, and impurity drops to \(0\) at a pure leaf.
Propose a split by clicking inside the active region, inspect the weighted average impurity, and then fix the split. Each region displays its current majority-vote class label.
ID3(S):
If I(S) is small or can't split anymore:
return a leaf with the majority label in S
For each candidate split (f, t):
S_L = { (x, y) in S : x_f ≤ t }
S_R = { (x, y) in S : x_f > t }
Compute I_T(S; f, t) = (|S_L| / |S|) I(S_L) + (|S_R| / |S|) I(S_R)
Choose the split (f*, t*) with the smallest I_T(S; f*, t*)
Create an internal node with test: x_f* ≤ t* ?
Left child = ID3(S_L)
Right child = ID3(S_R)
return the resulting tree
Because some useful structures only emerge after multiple coordinated splits. The XOR pattern is the canonical example: no single first split solves it, but a depth-2 tree does.
Choose an impurity criterion, then click a button to run the greedy ID3-style procedure and display its decision boundary on the dataset.
For regression, leaves predict numeric values. The notes use squared loss as impurity and predict the mean label in each leaf.
Here \(\bar{y}_S\) is the average label in subset \(S\). A split is good if it creates children whose values are tightly concentrated around their own means.
The learned function becomes piecewise constant: each leaf corresponds to a region with its own average response.
Leaf stores a class label or class histogram.
Leaf stores a real number, usually the mean of the training targets in that leaf.
A CART-style regression demo. Adjust depth, leaf size, and noise to see how the piecewise-constant fit changes and how the split locations move.
If we got to max depth we can optimally fit training data but will also overfit
Mitigation 1: put consrtraint on max depth
Migitigation 2: Put a constraint on minimum leaf size.
Cool Mitigation: Next lecture, allow max depth but ensemble many of them.
This classification demo sweeps tree depth on a fixed train/validation split and plots how the training and validation errors evolve.
A decision tree is a recursive partition of feature space.
Training is greedy: pick next split with least impurity.
Different imprity functions: entropy, Gini etc
CART extends the same template to regression with squared loss.
Can overfit but there are cool mitigations to come soon.